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SUWMUtY 


This  document  contains  a  series  of  papers  presented  at  a  Department  of  Defense/Educational 
Testing  Service  conference  held  in  San  Diego,  California  In  March  1987.  As  such,  it  describes 
ongoing  research  and  development  (R4D)  within  the  Air  Force  Human  Resources  Laboratory’s  Job 
Performance  Measurement  Project.  Papers  on  test  content  selection,  work  sample  testing, 
predictive  efficiency  of  the  Armed  Services  Vocational  Aptitude  Battery,  training  evaluation,  and 
transfer  of  these  technologies  to  users  within  the  DoD  user  community  demonstrate  both  the 
breadth  of  RAD  and  the  broad  applicability  of  this  performance  measurement  technology. 
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PREFACE 


Active  research  and  development  In  the  area  of  job  performance  began  In  the  early 
1980's  across  the  four  Services  In  response  to  a  Congressional  mandate  and  requests  from 
the  Services'  user  conmunltles.  In  March  1987,  the  Department  of  Defense  and  the 
Educational  Testing  Service  hosted  a  conference  on  Job  Performance  Measurement 
Technologies  to  highlight  Service  efforts  and  to  serve  as  a  forum  for  discussion  among 
the  Services  themselves  and  public  and  private  agencies  engaged  In  similar  activities. 
The  papers  presented  here  describe  the  Air  Force  Job  Performance  Measurement  effort. 
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JOB  PERFORMANCE  MEASUREMENT: 
TOPICS  IN  THE  PERFORMANCE  MEASUREMENT 
OF  AIR  FORCE  ENLISTED  PERSONNEL 


I.  TEST  CONTENT  SELECTION 

M.  Suzanne  Lipscomb 
Air  Force  Human  Resources  Laboratory 

Terry  L.  Dickinson 
Old  Dominion  University 

In  the  development  of  any  test,  It  Is  rsrely  possible  to  construct  and  administer  items  which 
completely  exhaust  the  domain  of  content  to  be  measured.  Time  and  other  practical  considerations 
constrain  what  can  be  covered  In  any  given  testing  situation.  Thus,  unless  the  content  to  be 
measured  Is  very  narrowly  defined.  It  Is  necessary  to  rely  on  samples  to  generalize  to  the  domain 
and,  ultimately,  the  universe.  This  Is  particularly  true  In  the  case  of  hands-on  work  sample 
testing,  where  the  number  of  tasks  that  can  be  covered  Is  restricted. 

The  quality  of  the  generalizations  or  Inferences  made  from  test  scores  Is  directly  related  to 
the  quality  of  the  definition  and  the  sampling  of  the  domain.  In  order  to  make  valid 
generalizations,  the  domain  must  be  well  defined  and  the  sampling  must  be  relevant  and 
representative.  This  requirement  for  the  test  to  represent  the  larger  domain  also  extends  to  the 
selection  of  types  of  Items,  Item  quality,  and  the  administration  and  scoring  procedures  used. 
The  critical  question  becomes,  "Does  a  person's  score  on  this  test  reflect  his/her  standing  on 
the  entire  domain  of  Interest?"  This,  In  turn,  leads  to  the  question  of  test  validity. 

Oeflned,  the  validity  of  a  test  concerns  how  well  a  test  measures  what  It  purports  to  measure 
(Allen  &  Yen,  1979;  Anastasl,  1982).  Thus,  validity  refers  to  the  accuracy  of  predictions  or 
inferences  made  from  test  scores,  taking  Into  consideration  the  particular  use  of  the  test 
(Cronbach,  1971). 

Though  there  have  been  calls  for  a  more  unified  view  of  the  validation  process  (Landy,  1986), 
procedures  for  determining  the  validity  of  a  test  are  often  classified  within  three  general 
categories:  content,  criterion-related,  and  construct  validity.  These  three  types  of  validation 
procedures  are  Interrelated,  with  each  addressing  a  specific  aspect  of  the  test  and  the 
Interpretation  of  scores  on  the  test.  Broadly  defined,  content  validity  refers  to  the  extent  to 
which  the  content  of  the  test  represents  the  behavioral  domain  to  be  measured.  Criterion -related 
validity  reflects  the  effectiveness  of  a  test  In  predicting  a  person's  behavior  In  a  specified 
situation,  either  concurrently  with  the  test  or  in  the  future.  Construct  validity  is  concerned 
with  the  extent  to  which  a  test  measures  a  theoretical  construct  or  trait.  It  is  evaluated  by 
investigating  the  degree  to  which  certain  explanatory  concepts  account  for  performance  on  the 
test  (Cronbach,  1971). 

Duality  tests  are  constructed  with  the  goal  of  validity  In  mind.  It  is  the  aim  of  the  test 
constructor  to  develop  a  test  which  measures  what  he/she  has  set  out  to  measure,  whether  it  Is  a 
trait,  aptitude,  or  achievement  level.  The  first  step  in  the  test  development  process  Is  the 
specification  of  what  Is  to  be  measured.  This  can  be  viewed  as  a  process  of  Identifying  the 
performance  universe  which  encompasses  all  possible  behaviors  relevant  to  the  measurement  goal. 
The  content  domain  Identifies  a  portion  of  the  content  universe  for  the  purposes  of  testing.  The 
test  content  universe,  which  can  then  be  specified  at  least  theoretically,  consists  of  all 
possible  test  Items  that  can  be  developed  for  the  content  domain,  as  well  as  conditions  of 
testing  and  scoring  procedures.  From  this  test  content  universe,  a  sample  of  Items  is  taken 
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which  defines  the  actual  specifications  for  test  construction.  This  constitutes  the  test  content 
domain  from  which  the  test  Is  constructed  (Gufon,  1979). 

It  Is  this  process  of  specifying  the  sample  of  what  Is  to  be  measured  and  how  It  Is  to  be 
measured  that  ultimately  determines  the  validity  of  the  measure.  Although  this  process  Is 
usually  the  focus  of  content  validity  evaluation,  It  also  has  direct  Impact  on  the 
criterion-related  and  construct  validity  of  the  test.  Mlsspeciflcatlon  of  any  of  the  areas  of 
interest  from  the  content  universe  to  the  test  content  universe,  or  an  Inappropriate  sampling 
procedure,  may  result  in  the  measurement  of  something  other  than  what  was  intended. 

It  is  this  process  of  domain  specification  and  sampling  that  Is  the  focus  of  this  paper.  The 
strategy  used  to  develop  Air  force  Walk-Through  Performance  Tests  will  be  described,  and  issues 
relating  to  the  use  of  such  a  strategy  will  be  discussed. 


THE  AIR  FORCE  OONAIN  SPECIFICATION  AND  SAMPLING  PLAN 

When  developing  task-based  Job  performance  measures.  It  Is  Impractical  to  assess  performance 
on  the  universe  of  tasks  within  most  Air  Force  specialties  (AFSs).  No  Individual  performs  all  of 
the  tasks  In  any  specialty,  and  no  individual  performs  an  average  job  In  most  specialties. 
Rather,  the  tasks  of  a  specialty  are  distributed  by  management  action  to  Individuals  In 
consistent  ways  so  as  to  cluster  Into  a  variety  of  types  of  jobs.  These  clusters  are  based  on 
the  co-performance  of  tasks  and  the  variations  In  mission,  equipment,  or  management  In  any  given 
locale.  This  variance  of  jobs  within  AFSs  Is  an  exceedingly  Important  phenomenon,  as  It  Impacts 
on  how  the  specialty  Is  organized  In  the  personnel  system,  the  aptitudes  required,  the  training 
provided,  and  the  way  Individuals  can  be  utilized  In  the  workplace  (Mitchell  &  Drl skill ,  1979). 
This  variance  In  Air  Force  jobs  Is  of  concern  to  Air  Force  managers  and  Is  one  of  the  major 
Issues  of  study  In  the  occupational  analysis  program  (AF  Regulation  35-2,  Occupational  Analysis 
Program).  Data  on  most  AFSs  Indicate  that  the  classification  structure  of  the  Air  Force  Is 
highly  dynamic,  with  frequent  reallocation  of  tasks  among  specialties.  In  addition,  while  there 
may  be  some  comnon  tasks  performed  by  a  majority  of  Individuals  In  a  specialty,  most  of  the  tasks 
are  performed  only  by  members  of  the  various  job  types  within  the  specialty. 

It  Is  necessary,  therefore,  to  rely  on  samples  of  performance  that  are  both  useful  for 
differentiating  between  good  and  poor  performers  and  representative  of  the  performance  domain. 
Differentiating  good  and  poor  performers  can  be  accomplished  by  assessing  job  Incumbent 
performance  on  tasks  with  a  range  of  difficulty.  In  addition  to  Identifying  tasks  with  a  range 
of  difficulty,  selecting  tasks  that  adequately  represent  the  total  specialty  domain  Is  necessary 
to  make  Inferences  about  performance  from  a  sample  of  specific  tasks.  If  the  specialty  domain  Is 
adequa  ly  represented  by  the  tasks  selected,  the  task-based  measurement  system  can  be  considered 
content  valid.  Unlike  other  aspects  of  test  validity,  the  content  validity  of  a  measurement 
procedure  Is  not  a  correlational  process  but  an  evaluation  of  adequacy  and  representativeness 
using  rational  Judgments.  Lennon  (1956)  stated  that  three  assumptions  underlie  the  use  of 
content  validity:  (a)  The  area  of  concern  to  the  user  can  be  conceived  as  a  meaningful, 
definable  universe;  (b)  a  sample  can  be  drawn  from  the  universe  In  some  purposeful,  meaningful 
fashion;  and  (c)  the  sample  and  the  sampling  process  can  be  defined  with  sufficient  precision  to 
enable  the  user  to  judge  how  adequately  the  sample  typifies  the  universe. 

Given  the  Information  available  In  the  Air  Force  Occupational  Research  Data  Base,  these  three 
assumptions  can  be  met;  thus,  the  Issue  of  content  validity  can  be  addressed.  The  universe  can 
be  defined  as  the  set  of  tasks  for  an  AFS  as  detailed  by  the  occupational  survey  task  list.  The 
sample  can  be  drawn  In  a  meaningful  fashion  based  on  the  task-level  occupational  survey  data 
available.  Finally,  the  sampling  process  can  be  defined  with  precision  using  a  task  sampling 
plan  which  will  allow  a  judgment  to  be  made  as  to  the  adequacy  of  the  sample. 
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The  task  sampling  plan  consists  of  a  procedural  set  of  guidelines  which:  (a)  specify  the 
task  clusters  d.e..  Jobs)  of  Interest,  (b)  establish  the  level  of  measurement  specificity,  and 
(c)  determine  the  proportional  weighting  (Importance)  of  the  work  activities  Identified.  These 
guidelines  assure  objectivity,  replicability,  and  comparability  of  efforts  to  develop  measures 
which  detect  meaningful  differences  In  performance,  and  their  use  Is  Illustrated  below  for  the 
first  AFS  investigated  by  the  Air  Force  for  the  Joint-Service  Job  Performance  Measurement 
project.  Jet  Engine  Mechanic  (AFS  426X2). 


Task  Selection  Procedural  Guidelines 

Defining  the  Job  Ocmaln 

For  most  AFSs,  the  Air  Force  has  a  wealth  of  Information  sources  which  give  a  comprehensive 
picture  of  the  work  domain;  for  example,  AFS  entrance  requirements  and  a  general  specialty 
description  (AFR  39-1,  Airman  Classification  Regulation);  AFS  training  requirements  (AFM  50-5, 
USAF  Formal  Schools  Catalog  and  Specialty  Training  Standards);  and  occupational  survey  data. 

Occupational  survey  data  include:  the  percentage  of  incumbents  who  report  performing  each 
task,  the  amount  of  time  Incumbents  report  performing  the  task  relative  to  other  tasks, 
subject-matter  experts'  (SMEs)  judgments  of  the  relative  time  required  to  learn  to  perform  tasks 
(l.e.,  task  difficulty),  and  the  relative  importance  of  training  for  each  task  d.e.,  training 
emphasis).  Occupational  survey  data  and  training  Information,  which  cover  the  full  scope  of 
tasks  performed  by  Incumbents  In  an  AFS,  were  applied  to  the  development  of  the  task-based 
performance  measures.  Because  occupational  survey  data  are  the  most  detailed  and  comprehensive, 
they  were  used  to  define  the  work  domain.  The  other  sources  provided  complementary  Information. 

The  goals  of  the  Air  Force  Job  Performance  Measurement  Project  are  to  assess  specific  job 
competencies  required  within  a  specialty,  and  general  competencies  applicable  across  AFSs.  These 
two  types  of  measures  require  four  levels  of  measurement  specificity:  Air  Force-wide, 
specialty-wide,  duty-core,  and  Incumbent-unique  measures.  Since  the  focus  of  this  paper  is  on 
selectin'"  tasks  required  to  measure  Individuals’  competence  within  an  AFS,  only  the  latter  three 
levels  of  measurement  specificity  are  described. 

To  Include  an  adequate  representation  for  each  of  these  three  levels  of  measurement 
specificity,  tasks  within  an  AFS  must  be  categorized  accordingly.  That  is,  tasks  must  be 

categorized  into  those  performed  throughout  the  specialty  (l.e.,  specialty -wide),  those  specific 

to  certain  duties  within  an  AFS  (l.e.,  duty-core),  and  those  uniquely  performed  by  Incumbents  In 
certain  job  types  (l.e..  Incumbent-unique). 

The  occupational  survey  task  Inventory  was  used  to  define  the  work  domain  and  categorize 

tasks.  Because  task  performance  Is  often  specific  to  equipment  or  work  centers,  tasks  associated 
with  equipment  or  work  centers  were  used  to  Identify  the  duty-core  domain.  Finally,  tasks 

associated  with  specific  job  types  defined  by  the  occupational  analysis  were  used  to  delineate 
the  Incumbent-unique  domain.  Since  It  would  be  Impractical  to  cover  adequately  all  duty  areas 
and  job  types  within  heterogeneous  AFSs,  those  most  representatl ve  of  the  work  performed  were 
selected.  That  Is,  duty  areas  and  Job  types  which  have  the  largest  percentage  of  personnel  were 
chosen. 


Selecting  Tasks  Representative  of  the  Job  Domain 

The  procedures  for  sampling  tasks  representative  of  the  three  levels  of  measurement  specif¬ 
icity  are  outlined  In  the  following  paragraphs,  along  with  the  rationale  for  these  procedures. 
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For  each  task  domain,  the  number  of  tasks  selected  was  based  on  a  judgment  of  the  number  of 
performance  measures  required  to  give  an  adequate  sample  and  conformed  to  a  total  testing  time  of 
no  more  than  8  hours  for  all  measures.  This  time  limit  was  the  maximum  time  feasible  to  keep  an 
airman  from  his/her  unit,  within  this  timeframe.  Individuals  were  assessed  on  specialty-wide, 
duty-core,  and  incumbent-unique  tasks. 


Phase  I.  Selection  of  Specialty-Wide  Tasks 

Step  1 .  Select  all  tasks  which  are  Included  In  the  Plan  of  Instruction  (POI)  for  Initial  AF$ 
training  or,  if  not  In  the  POI,  are  performed  by  at  least  301  of  the  fl rst-term  Incumbents,  1-48 
months  total  active  Federal  military  service  (TAFMS). 

This  reduced  the  task  pool  to  those  tasks  deemed  Important  enough  for  training  or  those 
performed  by  a  substantial  number  of  first-term  airmen  across  the  AFS.  (The  30%  cutoff  value  may 
be  varied  by  specialty  according  to  the  number  of  tasks  performed  by  first-termers  In  that 
specialty. ) 

Step  2.  Cluster  tasks  selected  In  Step  1  based  on  one  or  more  of  the  following:  (a)  factor 
or  cluster  analysis  of  co-performance  data,  (b)  Specialty  Knowledge  Test  outline,  (c)  Specialty 
Training  Standard  outline,  or  (d)  occupational  survey  duty  outline.  Each  of  these  Is  a  means  of 
organizing  the  pool  of  tasks  Into  performance/knowledge  areas  based  on  occupational  Information. 
All  are  similar  In  results;  thus,  the  selection  of  the  clustering  strategy  should  be  based  on  a 
judgment  as  to  which  Is  cost  effective  and  well  suited  to  the  development  of  performance  measures 
for  a  specific  AFS. 

Step  3.  Weight  each  task  cluster  to  reflect  Its  relative  Importance  to  the  overall 
performance  of  first-term  airmen  within  the  specialty.  Possible  sources  for  weighting  clusters 
Include  the  following:  (a)  Specialty  Knowledge  Test  outline  weights,  (b)  Specialty  Training 
Standard  proficiency  level  requirements,  (c)  SHE  Judgments  of  relative  Inportance,  and  (d) 
weights  derived  from  training  emphasis  ratings  (l.e.,  SHE  judgments  of  the  extent  to  which 
training  Is  required  for  tasks)  and  percent  time  spent  ratings  (l.e.,  Incumbent  judgments  of  the 
relative  time  spent  performing  tasks).  These  latter  weights  could  be  derived  as  the  product  of 
the  mean  training  emphasis  rating  and  the  cumulative  time  spent  performing  tasks  In  a  cluster. 

Step  4.  Determine  the  number  of  tasks  to  be  selected  from  each  cluster  to  reflect  the 
assigned  weights.  Total  the  cluster  weights,  and  divide  each  cluster  weight  by  the  total  to  get 
Its  relative  percentage  of  Importance.  Multiply  each  cluster  percentage  by  the  total  possible 
tasks  to  determine  the  manber  of  tasks  to  be  selected. 

Step  5.  WltMn  each  cluster,  randomly  select  the  number  of  tasks  determined  In  Step  4  to 
reflect  a  range  of  learnlng/task  difficulty  by:  (a)  ranking  the  tasks  on  task  difficulty,  (b) 
dividing  the  ranked  list  Into  quartlles,  (c)  selecting  40%  of  the  tasks  from  the  fourth  quartlle, 
(d)  selecting  30%  from  the  third  quartlle,  (e)  selecting  20%  from  the  second  quartlle,  (f) 
selecting  10%  from  the  first  quartlle,  and  (g)  repeating  for  each  cluster.  (It  Is  laportant  to 
sample  tasks  covering  a  range  of  difficulty  so  Incumbent  performance  assessment  will  reflect  the 
rank-ordering  of  people  of  varying  levels  of  job  competence.  The  sailing  Is  more  heavily 
weighted  on  the  more  difficult  tasks  to  reflect  the  aptitude  requirements  of  the  specialty  and 
where  more  performance  variation  should  occur.) 

Step  6.  Review  the  tasks  Identified  In  Step  5  to  determine  If  they  can  be  measured  by  either 
the  hands-on  or  Interview  coeponents  of  Walk-Through  Performance  Testing  (WTPT).  Return  any  task 
found  to  be  unsuitable  for  WTPT  to  the  task  pool.  If  possible,  randomly  select  a  replacement 
task  from  the  same  task  difficulty  quartlle.  Document  why  the  original  task  was  unsuitable. 
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'  Tn«  ability  to  assess  performance  through  observation/interview  procedures  (WTPT )  is  a 
prerequisite  to  final  task  selection  because  performance  measures  obtained  via  these  high- 
fidelity  techniques  will  be  the  benchmarks  against  which  surrogate  measures  are  compared.) 


Phase  II.  Selection  of  Duty-Core  Tasks 


Because  the  performance  domain  for  a  duty  area  (e.g.,  a  specific  engine  type  or  work  center) 
Is  much  less  broad  than  for  the  entire  specialty,  fewer  tasks  are  needed  for  an  adequate  sample. 
Also,  because  tasks  selected  for  one  duty  area  may  be  performed  In  another  area,  tasks  can  be 
selected  for  more  than  one  duty  area.  However,  since  tasks  selected  In  Phase  I  for 
specialty-wide  measures  will  be  used  to  assess  all  Incumbents,  they  should  not  be  used  to  develop 
duty-core  measures.  The  following  steps  apply  for  each  duty  area. 

Step  1 .  Select  all  tasks  performed  by  at  least  AOS  of  the  first-term  airmen  Identified  as 
performing  the  duty  In  question  (as  noted  earlier,  this  cutoff  may  vary  according  to  number  of 
tasks  performed  by  first- termers)  and  not  utilized  In  Phase  I.  (Within  each  duty  area,  a  higher 
proportion  of  Incumbents  performing  tasks  can  be  used  as  a  basis  for  Identifying  tasks  to  be 
assessed  because  the  performance  domain  Is  more  narrowly  defined  than  across  the  entire 
specialty.) 


Step  2.  From  the  tasks  Identified  In  Step  1,  select  the  total  number  of  tasks  to  reflect  a 
range  of  learning/task  difficulty  by  repeating  Phase  I,  Step  5. 

Step  3.  Repeat  Phase  I,  Step  6. 


Phase  III.  Selection  of  Inc«md>ent-Un1que  Tasks 

Because  the  performance  domain  for  each  Job  type  Is  less  broad  than  for  the  entire  specialty, 
fewer  tasks  are  needed  to  provide  an  adequate  sample.  Also,  because  the  tasks  selected  for  a  job 
type  may  be  applicable  to  another  job  type,  tasks  selected  for  one  may  be  used  for  another. 
However,  tasks  selected  In  Phases  I  or  II  should  not  be  used  to  develop  Incumbent-unique 
measures.  The  following  steps  apply  for  each  job  type. 


Step  1.  Select  all  tasks  performed  by  50%  or  more  of  the  Incumbents  In  the  Incumbent-unique 
group  and  not  utilized  In  Phases  I  or  II.  (Again,  as  the  job  domain  becomes  more  specific,  It  Is 
possible  to  select  tasks  performed  by  a  higher  proportion  of  Incumbents.  In  addition,  the  cutoff 
may  vary  by  number  of  tasks  performed  by  fl rst- termers. ) 
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Step  2.  From  the  tasks  Identified  In  Step  1,  select  the  total  number  of  tasks  to  reflect  a 
range  of  learning/task  difficulty  by  repeating  Phase  I,  Step  5. 

Step  3.  Repeat  Phase  I,  Step  6. 


Review  and  Approval  of  Task  Sample 

A  description  of  the  application  of  these  task  sampling  procedures  and  the  tasks  selected  for 
each  AFS  was  reviewed  by  appropriate  AFS  functional  managers  and  technical  training  representa¬ 
tives,  who  provided  feedback  concerning  the  adequacy  of  the  tasks  selected.  Reviewers  examined 
the  task  sample  to  ensure  that  work  performed  by  first-term  airmen  and  critical  wartime  require¬ 
ments  were  well  represented.  Approval  of  the  task  sample  by  these  policy  makers  should  Increase 
the  acceptance  and  utilization  of  the  resulting  Job  performance  measures. 
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APPLICATION  TO  THE  JET  ENGINE  MECHANIC  SPECIALTY  (APS  426X2) 

Before  the  sampling  plan  could  be  applied  to  the  Jet  Engine  Mechanic  Specialty,  duty  areas 
and  Incumbent-unique  job  types  were  Identified.  How  these  were  chosen  Is  described  before 
Illustrating  employment  of  the  task  selection  plan. 


Defining  the  Job  Domain 


Duty  Areas  Selected.  Outy  areas  were  selected  based  on  engine  type  maintained.  An 
Inspection  of  the  occupational  survey  data  revealed  that  20*,  18*,  and  17*  of  AFS  426X2 
first-term  airmen  performed  maintenance  tasks  on  J -57,  J-79,  and  TF-33  engines,  respectively. 
Since  these  percentages  were  the  highest  among  the  nine  engine  types  maintained  by  AFS  426X2 
personnel,  these  three  engines  were  selected  as  being  representatl ve  of  equipment  maintained  by 
first-term  jet  engine  mechanics. 

Job  Types  Selected.  The  occupational  survey  data  also  revealed  that  the  vast  majority  of 
first-term  jet  engine  mechanics  perform  similar  jobs  (l.e.,  most  airmen  maintain  similar  engine 
accessory  systems).  The  largest  percentage  of  first-term  Incumbents  In  each  major  comnand  spend 
the  majority  of  their  time  performing  general  engine  maintenance  tasks  In  shop  or  on  the 
fllghtllne.  As  a  result,  these  two  functional  areas  were  Identified  as  being  representative  of 
AFS  426X2. 


Phase  I.  Selecting  Specialty-Core  Tasks 


Task  Clustering.  Tasks  were  clustered  by  occupational  survey  duty  area  because  this  grouping 
adequately  reflected  the  work  done  In  the  specialty  and  was  cost  effective.  Weights  were 
computed  based  on  the  product  of  the  mean  training  emphasis  rating  and  the  cumulative  time  spent 
performing  tasks  In  a  cluster.  The  following  six  task  clusters  received  the  weights  Indicated. 

Cluster  Weight 

Preparing  and  Maintaining  Forms,  Records  and  Reports  10 

Performing  Quality  Control  Functions  5 

Performing  Fllghtllne  Engine  Maintenance  Functions  10 

Performing  In-Shop  Engine  Maintenance  Functions  20 

Performing  Test  Cell  Functions  5 

Performing  General  Engine  Maintenance  Functions  50 

Task  Selection  and  Review.  The  remaining  Phase  I  steps  were  followed,  and  18  tasks  were 
selected  to  reflect  the  weights  outlined  above.  Ten  tasks  were  selected  for  each  engine  tyP«  In 
Phase  II  and  ten  for  each  job  type  In  Phase  III.  Selected  tasks  were  reviewed  by  SMEs  and 
unsuitable  tasks  deleted.  New  tasks  were  selected  and  reviewed  and  the  tasks  list  finalized, 
giving  a  representative  set  of  tasks  upon  which  to  develop  performance  measures.  The  main 
justifications  for  the  task  exclusions  were: 

Task  not  common  to  all  engines  (Phase  I) 

Task  not  comon  to  all  functional  areas  (Phases  I  and  II) 

Task  unclear,  too  broad,  complex,  or  trivial 

Task  not  representative  of  functional  area  (Phase  III) 

Task  performed  differently  on  different  aircraft  (Phase  I) 

Overlapping  or  similar  task 
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Task  performed  differently  depending  on  how  engine  is  shipped  (air,  rail,  or  truck)  and  Its 
destination  (depot,  deployment) 

Task  performed  differently  depending  on  organizational  unit  (Examples:  Some  supervisors  do 
not  allow  test  cell  personnel  to  transport  engines.  Strategic  Air  Coemand  fllghtllne 
personnel  do  not  make  entries  on  oil  analysis  request  forms  (DO  Form  2026),  but  Military 
Airlift  Command  flightline  personnel  do  make  such  entries.) 

Equipment  being  changed  within  the  year 

In  summary,  a  strategy  for  task  selection  was  developed  to  sample  tasks  representative  of  the 
job  content.  This  strategy  was  applied  to  the  Jet  Engine  Mechanic  Specialty  (426X2),  and  the 
selected  tasks  were  used  to  develop  performance  measures  and  standards. 


CONCLUSION 

The  Air  Force  test  content  selection  strategy  combines  expert  judgment  and  stratified  random 
sampling  to  select  representative  tasks.  The  Input  of  SMEs,  In  the  form  of  task  factor  data  and 
review,  helps  focus  the  task  selection  process  on  areas  of  Importance.  This  Is  significant  given 
the  limited  number  of  tasks  which  can  be  covered  In  a  hands-on-testing  situation  due  to  time  and 
cost  constraints.  Thus,  It  Is  critical  that  only  those  tasks  deemed  most  representative  be 
considered  for  Inclusion  In  the  test.  The  random  sailing  aspect  of  the  sampling  procedure  helps 
assure  the  generalizablllty  of  the  sampled  tasks.  Thus,  expert  judgment  and  random  sampling 
procedures  are  Incorporated  In  the  sampling  strategy  to  develop  a  content-valid  test  consisting 
of  an  optimal  set  of  tasks  for  use  In  performance  testing. 

There  are  several  research  questions  that  are  raised  In  the  consideration  of  the  utility  of  a 
content  selection  plan.  Questions  that  merit  attention  concern  the  choice  and  definition  of 
parameters,  the  level  of  detail  required,  the  degree  of  judgment  Involved,  and  the 
appropriateness  and  generalizablllty  of  one  selection  strategy  across  many  different  content 
areas.  Research  addressing  these  questions  will  have  both  practical  and  theoretical  value.  Such 
research  Is  currently  being  planned  by  the  Air  Force. 
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II.  WORK  SAMPLE  TESTING  IN  THE  AIR  FORCE  JOB 
PERFORMANCE  MEASUREMENT  PROJECT 


Jerry  W.  Hedge 
M.  Suzanne  Lipscomb 
Mark  S.  Teachout 

Air  Force  Human  Resources  Laboratory 

The  Air  Force  Human  Resources  Laboratory  (AFHRL)  Is  conducting  a  large-scale  effort  to 
develop  a  measurement  technology  for  systematically  obtaining  job  performance  data.  The  overall 
program  of  research  calls  for  the  development  of  criterion  measures  that  allow  for  the  collection 
of  valid,  accurate,  and  reliable  Job  performance  Information.  The  chief  aim  of  criterion 
development  research  Is  to  Identify  the  measurement  technique  that  most  faithfully  represents 
relevant  job  behaviors.  One  approach  that  Is  considered  to  have  high  fidelity  to  the  work 
environment  Is  work  sample  testing.  The  focus  of  this  paper  Is  work  sample  testing  In  the  Air 
Force  Job  Performance  Measurement  Project.  The  paper  describes  the  work  sample  philosophy  and 
developmental  process  used  by  the  Air  Force,  and  details  test  administrator  qualifications, 
training,  and  data  collection.  Relevant  data  associated  with  hands-on  and  Interview  work  saddles 
will  be  presented.  Including  a  comparison  of  the  two  approaches. 


AIR  FORCE  MOHR  SAMPLE  TESTING 

As  noted  by  Wilson  (1962),  over  the  years  the  primary  use  of  the  work  sample  has  been  for 
personnel  selection.  However,  this  approach  can  be  a  valuable  aid  In  the  measurement  of  job 
proficiency.  Typically,  work  sample  tests  Involve  an  Individual  In  performing  a  task  or  set  of 
tasks  relevant  to  that  person's  Job  and  selected  from  the  range  of  tasks  performed  by  the  job 
Incumbent.  The  value  of  the  work  sample  methodology  lies  In  the  fidelity  with  which  the  selected 
set  of  tasks  allow  measurement  of  an  Incumbent's  Job  proficiency.  This  can  also  be  a  weakness  of 
the  technique.  Unfortunately,  work  sample  procedures  normally  Identify  critical  tasks,  discard 
those  not  practically  measurable,  and  then  simply  allow  the  remainder  to  become  the  selected  set 
of  tasks  to  be  measured.  AFHRL' s  approach  to  work  sample  testing  Is  an  attempt  to  overcome  this 
criterion  deficiency  problem. 


Walk-Through  Performance  Testing 

For  the  Air  Force,  hands-on  testing  Is  a  particular  problem  because  of  the  complexity  and 
expense  Involved  In  performing  many  tasks.  For  example,  some  critical  tasks  cannot  be  measured 
by  hands-on  testing  because  these  tasks  tend  to  take  too  long  to  complete,  require  replacement  of 
expensive  parts,  or  risk  possible  damage  to  components.  AFHRL  has  developed  a  new  methodology  to 
deal  with  these  problems.  This  new  approach,  Walk-Through  Performance  Testing  (WTPT),  has  as  Its 
foundation  the  work  sample  philosophy  but  attempts  to  expand  the  measurement  of  critical  tasks  to 
Include  those  tasks  not  measured  by  hands-on  testing,  through  the  addition  of  an  Interview 
testing  component  (Hedge,  1984). 

The  hands-on  component  of  the  WTPT  resembles  a  traditional  hands-on  work  sanple  test  designed 
to  measure  proficiency  on  a  critical  task.  For  exaeple,  one  hands-on  task  may  require  an 
Incumbent  to  Install  a  starter  on  a  Jet  engine.  On  the  first  page  of  the  test  administrator's 
manual.  Information  Is  provided  to  the  test  administrator  concerning:  required  testing  time; 
tools,  technical  orders,  and  Job  guides;  pertinent  background  Information  and  required  engine 
configuration;  and  test  administrator  Instructions.  While  the  starter  Is  being  Installed,  the 


test  adnlnl  strator  uses  a  checklist  to  Indicate  whether  steps  (e.g.,  lubricate  the  spline.  Index 
the  position  of  the  starter,  and  Install  the  locking  device)  are  performed  correctly.  Finally,  a 
5-polnt  rating  scale  allows  the  test  admlnl strator  to  record  an  overall  rating  of  proficiency  on 
the  task. 

Many  tasks  are  either  too  time-consuming,  too  costly,  or  too  dangerous  to  measure  by  hands-on 
testing.  Interview  testing  attempts  to  expand  the  content  domain  by  measuring  tasks  that  cannot 
be  measured  practically  with  the  hands-on  method.  Interview  testing  requires  the  Incumbent  to 
explain  the  step-by-step  procedures  necessary  for  successful  completion  of  the  task.  This  allows 
the  test  administrator  to  assess  an  Incumbent's  proficiency-based  strengths  and  weaknesses 
related  to  the  performance  of  that  task.  For  example,  an  Interview  Item  may  test  an  Incumbent's 
ability  to  determine  the  source  of  high  oil  consumption.  Once  again,  on  the  first  page  of  the 
adnlnstrator's  manual,  pertinent  information  Is  provided  to  the  test  adnlnl strator.  While  the 
Incunbent  Is  explaining  how  to  perform  the  task,  the  test  administrator  uses  a  checklist  to 
indicate  whether  the  steps  necessary  for  successful  performance  are  correctly  described.  In 
addition,  a  5-point  overall  proficiency  rating  Is  recorded  by  the  administrator. 

The  Interview  testing  is  conducted  at  the  work  site  In  a  "show-and-tell"  fashion  that  allows 
the  Incunbent  to  "visually  and  verbally"  describe  how  a  step  Is  to  be  accomplished  (e.g.,  "that 
bolt  Is  to  be  turned  five  revolutions"  or  "that  component  Is  to  be  lubricated  prior  to  being 
assembled").  Thus,  information  on  additional  tasks  can  be  collected  along  with  hands-on 
Information  to  provide  a  more  thorough  coverage  of  the  content  domain  and  a  more  accurate  picture 
of  an  individual's  job  proficiency. 


Task  Sailing  and  Item  Development 

An  extensive  task  sampling  plan  was  developed  for  each  job  using  Information  obtained  from 
the  Air  Force's  Occupational  Survey  Program  (Lipscomb,  1984).  This  program  maintains  job  content 
Information  for  over  200  of  the  250  specialties  In  the  Air  Force.  Surveys  for  each  of  these  200 
specialties  are  administered  approximately  every  4  years  to  keep  the  job  content  Information 
current.  Available  Information  In  each  job  content  domain  Includes  the  tasks  performed,  the 
relative  amount  of  time  spent  performing  these  tasks,  and  enihasls  to  be  given  to  the  tasks  In 
training.  This  Information  was  used  to  select  tasks  for  developing  WTPT  components. 

visits  were  made  to  several  Air  Force  bases  In  order  to  Interview  subject-matter  experts 
(SMEs)  about  these  tasks  (Alba,  Dickinson,  A  Lipscomb,  1985).  Using  the  appropriate  technical 
order,  the  SMEs  were  asked  to  describe  the  procedural  steps  Involved  In  performing  each  task, 
whether  procedures  for  performance  on  a  task  might  differ  by  location,  and  whether  the 
development  of  a  hands-on  test  for  a  task  was  feasible  (e.g..  In  terms  of  time,  and  safety  of 
equipment  and  personnel).  This  Information  was  used  to  delete  those  tasks  not  suitable  for 
testing.  Hands-on  and  Interview  tests  were  written  for  each  of  the  remaining  tasks.  These  tests 
were  reviewed  by  SMEs,  and  based  on  their  Input,  the  tests  were  refined.  Finally,  the  tests  **re 
field  tested  at  several  Air  Force  bases.  To  date,  work  sa^le  tests  have  been  constructed  In 
four  specialties:  Jet  Engine  Mechanic  (AFS  426X2),  Information  Systems  Radio  Operator  (492X1), 
Air  Traffic  Control  Operator  (AFS  272X0),  and  Avionic  Coamwnlcatlons  Specialist  (AFS  328X0). 


Test  Administrator  Training 

Experts  In  each  of  the  career  fields  under  study  were  selected  to  serve  as  test  admin¬ 
istrators.  These  personnel  consisted  of  active-duty  senior-level  Moncomalssloned  Officers  (NCOs) 
provided  by  their  Major  Comands,  or  recently  separated/retired  former  WCOs  hired  by  the 
contractor.  Because  the  test  a<ka1n1strators  were  already  experienced  at  performing  their 
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particular  jobs,  training  focused  on  the  logistics  of  the  effort,  and  Improving  their  observa¬ 
tion,  recording,  and  Interviewing  skills.  Logistics  training  focused  on  base  arrangements, 
testing  requi rements,  and  potential  problems.  Test  administrators  were  given  an  administrator' s 
manual  for  use  as  a  procedural  guide  In  the  field. 

To  emphasize  observation  and  scoring  skills,  videotapes  of  job  Incumbents  (actors)  performing 
or  describing  how  they  would  perform  tasks  allowed  the  test  administrators  to  practice  observing 
and  recording  performance  for  the  hands-on  and  Interview  tests.  Videotapes  were  constructed  for 
multiple  tasks  for  each  AFS.  Scenarios  were  generated  by  consulting  SMEs  as  to  where  and  how 
legitimate  performance  errors  could  be  made  within  each  task.  This  Information  was  used  to 
develop  correct  versions  in  which  job  Incumbents  performed  task  steps  correctly,  and  incorrect 
versions  containing  realistic  performance  errors.  One  correct  and  several  Incorrect  versions  of 
task  performance  were  videotaped. 

Interview  skills  training  utilized  videotapes  and  role  playing.  This  training  emphasized 
correct  and  Incorrect  procedures  to  follow  in  gathering  data  through  the  Interview,  modeling  of 
correct  Interviewer  behaviors,  and  face-to-face  interaction  between  the  test  administrator  and 
job  Incumbent. 


OATA  COLLECTION  AND  ANALYSIS 

Data  collection  was  conducted  In  a  standardized  fashion  for  all  four  AFSs  with  work  sample 
tests.  Because  data  collection  and  analysis  have  been  completed  only  on  jet  engine  mechanics, 
collection  and  analysis  details  will  be  specific  to  that  career  field. 

Data  Collection 

Three  3-man  teams  of  test  administrators  tested  first-term  (13-48  months  total  active  Federal 
military  service)  jet  engine  mechanics  using  the  Air  Force  work  sample  methodology.  Over  an 
8-hour  period,  10  hands-on  and  10  Interview  Items  were  administered.  Performance  on  5  tasks  was 
measured  using  both  hands-on  and  Interview  Items.  Each  of  these  20  work  sanple  items  carried  a 
point  value  of  10. 

Instrument  pretest  was  conducted  by  the  nine  test  administrators  at  three  Air  Force  bases  in 
the  continental  United  States  (CONUS).  Forty-two  job  Incumbents  were  tested  using  the  WTPT 
approach.  Full-scale  data  collection  occurred  at  13  Air  Force  bases  In  the  CONUS,  The  three 
teams  of  test  administrators  collected  data  from  255  Incumbents  who  were  randomly  selected  from 
the  population  of  -'Irst-term  mechanics  at  each  base. 


Oata  Analysts 


Jet  Engine  Mechanic  data  analysis  included  pretest  and  full-scale  data  collect  Ion  In  order  to 
address  the  following  issues  concerning  work  sample  tests:  (a)  interrater  reliability  of  test 
administrators;  (b)  test/retest  reliability;  (c)  dlscrlmlnablllty;  (d)  coaparlson  of  hands-on  and 
Interview  methods;  and  (e)  comparison  of  work  samples  with  other  relevant  variables.  For 
purposes  of  analysis,  summary  values  for  each  Instrument  were  derived  by  sunning  scores  across 
work  sample  Items.  In  this  way,  a  hands-on  summary  score  consisted  of  100  points  (ten  10-point 
Items)  and  an  Interview  suamiary  score  consisted  of  100  points  (ten  10-point  Items). 
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Reliability.  Accuracy,  and  Descriptive  Statistics 


Interrater  Reliability.  Data  from  the  three  teams  of  test  adnlnlstrators  were  analyzed  at 
three  separate  points  in  time.  During  the  training  workshop  (mentioned  earlier)  videotaped  task 
performances  were  shown  to  and  scored  by  all  test  adnlnlstrators.  This  allowed  both  an  analysis 
of  interrater  consistency  and  an  assessment  of  administrator  accuracy  (in  comparison  to  the 
videotaped  target  scores).  One  month  after  the  first  training  workshop,  pretest  data  were 
collected  from  14  job  incumbents  per  administrator  team.  For  each  team,  nine  Incumbents  were 
assessed  by  a  single  adnlnlstrator.  The  remaining  Incumbents  were  scored  by  the  3-member  team, 
allowing  an  evaluation  of  Interrater  reliability.  Two  and  one-half  months  after  the  first 
workshop,  a  retraining  workshop  was  held,  yielding  data  collection  and  analyses  comparable  to  the 
first  workshop. 

Pairwise  percent  agreement  Indices  were  computed  across  the  three  teams  of  test 
administrators.  The  arithmetic  averages  of  these  Indices  for  each  team  are  reported  In  Table  1. 
The  Indices  suggest  that  a  high  level  of  Interrater  reliability  was  obtained  for  the  three  teams 
at  all  points  In  time.  In  addition.  Interrater  agreement  tended  to  Improve  over  time. 

Table  1.  Interrater  Agreement  for  the  Workshops 
and  the  Pretest 

Team _ Workshop  1  Pretest  Workshop  2 


Interrater  Accuracy.  Percent  agreement  accuracy  Indices  were  calculated  for  each 
administrator  on  common  tasks  performed  In  each  workshop.  These  Indices  were  computed  between 
ratings  and  target  scores.  The  averages  of  the  Indices  for  each  team  across  tasks  are  reported 
In  Table  2.  Accuracy  was  quite  high  for  all  teams  at  both  workshops. 

Table  2.  Percent  Agreement  Between  Test  Adel nlstra tors 
and  Target  Scores  for  the  Workshops 

Team  Workshop  1  Workshop  2 


Test-Retest  Reliability.  Three  pretest  bases  were  revisited  during  full-scale  data 
collection  (approximately  2  months  after  pretest),  and  test  administrators  collected  work  sample 
data  from  all  available  Job  Incumbents  that  participated  In  the  pretest.  Test-retest  reliability 
estimates  were  computed  for  both  hands-on  and  Interview  tests.  For  all  hands-on  tests,  Incum¬ 
bents  were  rated  consistently  781  of  the  time.  Ratings  with  the  Interview  were  not  as 
consistent,  averaging  a  test-retest  correlation  of  .56.  This  lower  value  can  be  attributed  to 
the  greater  subjectivity  of  the  Interview  procedure.  Thus,  these  procedures  take  longer  to 
stabilize  than  the  more  objective  hands-on  procedures. 

Work  Sample  Descriptive  Statistics.  Ten  hands-on  work  sample  Items  comprised  the  hands-on 
work  sample  test.  Thus,  out  of  100  points,  the  255  Jet  engine  mechanics  scored  an  average  of 
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73.01,  with  a  standard  deviation  of  10.53.  For  the  10  Interview  work  sample  Items  that  comprise 
the  Interview  work  sample  test.  Incumbents  scored  62.35  on  the  average,  with  a  standard  deviation 
of  12.53.  These  average  scores  suggest  a  test  of  moderate  difficulty.  A  test  with  large  numbers 
of  Incumbents  scoring  quite  high  or  low  would  reduce  the  ability  to  discriminate  between 
Incumbents'  performance.  In  fact,  the  variability  in  test  scores  Is  quite  good,  with  the 
Interview  slightly  superior  to  the  hands-on  test.  Figures  1  and  2  show  the  distributions  of 
scores  for  both  the  hands-on  and  Interview  work  samples. 


Frequency 


3  10  20  30  40  50  60  70  30 


-tanas -On  Test  Scores 

Figure  1.  Hands-On  Work  Sample  Test  Score  Distribution. 
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Interview  Test  Scor-:3 

Figure  2.  Interview  Work  Sample  Test  Score  Distribution. 

For  a  closer  look  at  work  sample  test  scores,  Table  3  provides  a  task-by-task  breakdown  of 
means  and  standard  deviations  for  both  hands-on  and  Interview  Items. 
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Table  3.  Meins  and  Standard  Deviations  for  All  Hands-On 
and  Interview  Work  Sample  I  tees 


Item  number 

Hands-on 

Interview 

X 

SO 

X 

SD 

134 

7.22 

1.78 

6.15 

1.68 

347 

7.02 

1.62 

6.67 

1.88 

353 

7.63 

1.57 

6.51 

2.13 

360 

7.16 

1.85 

5.92 

2.07 

373 

8.48 

1.73 

6.25 

1.70 

363 

8.63 

1.55 

301 

6.22 

1.91 

302 

7.96 

1.63 

349 

5.55 

2.22 

385 

6.42 

3.17 

346 

7.04 

2.27 

171 

6.82 

2.58 

351 

6.70 

2.20 

359 

6.38 

2.15 

396 

7.00 

2.01 

387 

8.20 

1.60 

319 

4.71 

1.98 

239 

7.26 

2.66 

247 

7.95 

2.16 

238 

6.54 

2.12 

208 

7.03 

1.80 

325 

5.24 

1.84 

00 

(V 

CO 

7.00 

2.44 

Comparison  of  Hands-On  and  Interview  Work  Samples 

Correlation  Between  Hands-On  and  Interview  Tests.  To  assess  how  similarly  the  hands-on  and 
Interview  tests  rank  order  Job  Incumbents,  a  bivariate  correlational  analysis  was  performed  on 
summary  scores  from  the  10  hands-on  and  10  Interview  Items.  This  analysis  yielded  a  correlation 
of  . 57  between  the  hands-on  and  Interview  tests.  As  shown  In  Table  3,  five  hands-on  and  five 
Interview  Items  were  also  constructed  for  Identical  tasks.  These  five  Items  yielded  a 
correlation  of  .45  between  the  hands-on  and  Interview  work  sample  methods. 

Correlation  of  Performance-Relevant  Variables  with  Work  Samples.  In  addition  to  assessing 
the  correlation  between  hands-on  and  Interview  tests,  an  additional  step  was  taken  to  compare 
hands-on  and  Interview  measures.  This  entailed  examination  of  each  work  sample's  relationship  to 
other  relevant  variables  (e.g.,  predictors,  experience  Indices).  A  potential  surrogate  should  be 
not  only  meaningfully  related  to  the  hands-on  measure,  but  should  also  demonstrate  similar 
patterns  of  relationships  (as  the  hands-on)  with  other  variables. 

Correlations  were  calculated  between  the  hands-on  and  Interview  work  samples  and  13  relevant 
variables  as  follows:  six  experience  variables,  collected  via  questionnaire,  or  gathered  from 
personnel  files;  two  motivation  variables,  consisting  of  job  satisfaction  (8  Items)  and 
supervisory  support  (2  Items);  final  technical  training  school  grade;  and  scores  on  the  four 
aptitude  composites  from  the  Armed  Services  Vocational  Aptitude  Battery  (mechanical, 
administrative,  general,  and  electronics). 
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Table  4.  Correlations  Between  Work  Sample  Tests  and 
Perfonaance-Relevant  Variables 


Performance-relevant 


Work  samples 


•Significant  differences  between  hands-on  and  interview  Items  are 
at  the  .05  level. 

As  shown  in  Table  4,  a  test  of  significance  between  hands-on  and  interview  correlations 
produced  only  2  significant  differences  across  the  13  performance-relevant  variables.  This 
suggests  similar  patterns  of  relationships  between  the  hands-on  and  work  sample  tests  across  a 
set  of  coemon  variables. 


CONCLUSION  AND  IMPLICATIONS 

The  major  purpose  of  this  paper  was  to  explore  the  work  sample  testing  philosophy  and 
methodology  being  used  by  the  Air  Force  In  the  Joint-Service  Job  Performance  Measurement 
Project.  In  the  development  of  objective  measuring  devices,  a  main  requirement  Is  freedom  from 
biasing  errors  that  arise  from  the  performance  situation  and  the  method  of  measurement.  Biasing 
factors  In  the  performance  situation  (e.g.,  tasks  performed,  availability  of  tools  and  equipment) 
and  the  method  of  measurement  (e.g.,  objectivity  of  work  sample  test  administrators)  are  often 
found  to  systematically  Influence  the  quality  of  performance  and  performance  measurement. 
Consequently,  the  developmental  process,  test  administrator  training,  and  data  collection  were 
discussed  In  some  detail,  and  pertinent  analyses  presented  to  clarify  the  characteristics  of  the 
work  sample  tests.  In  addition.  It  was  desirable  to  draw  some  conclusions  about  criterion 
equivalence  between  the  hands-on  and  Interview  work  samples;  thus,  a  comparison  of  the  two 
approaches  was  presented. 

In  general,  both  the  hands-on  and  Interview  versions  of  Walk-Through  Performance  Testing  held 
up  well  under  close  scrutiny.  After  thorough  training,  test  administrators  were  able  to  rate 
reliably  and  accurately,  over  time,  using  both  the  hands-on  and  Interview  Instrueents.  Both 
tests  also  showed  moderate  mean  test  scores  and  sufficient  variability  to  suggest  good 
discrimination  among  Incumbents.  Finally,  two  analyses  examined  hands-on  and  Interview 
equivalence.  A  correlation  of  .57  provides  additional  strength  to  the  Interview  testing  approach 
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as  a  valid  work  sample  methodology.  In  terms  of  relationships  to  a  set  of  relevant  variables, 
the  Interview  showed  a  pattern  similar  to  that  of  the  hands-on  test.  Taken  together,  this 
Information  describes  a  solid  program  of  work  sample  testing.  Also,  a  first  demonstration  of 
Interview  testing  was  successful,  suggesting  this  approach  as  a  viable  new  work  sample 
technology.  Additional  analyses  are  underway  to  examine  the  types  of  task  characteristics  that 
make  some  tasks  more  amenable  to  Interview  work  sample  testing.  As  data  become  available  on 
three  additional  AFSs,  a  more  precise  understanding  of  hands-on  and  Interview  tests  Is  possible. 
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III.  PREDICTIVE  EFFICIENCY  OF  THE  ASVAB  FOR  THE  AIR  FORCE'S 
JOB  PERFORMANCE  MEASUREMENT  SYSTEM 


Terry  L.  Dickinson 
Old  Dominion  University 

Jerry  w.  Hedge 
Lt  Col  Rodger  0.  Ballentlne 

Air  Force  Human  Resources  Laboratory 

The  Air  Force  Human  Resources  Laboratory  (AFHRL)  Is  currently  developing  a  measurement  system 
for  obtaining  job  performance  data.  This  Job  Performance  Measurement  System  (JPMS)  win  serve 
three  Interrelated  purposes.  First,  the  JPMS  will  provide  operational  managers  of  the  Air 
Force's  human  resources  program  with  criteria  to  evaluate  program  effectiveness.  Second,  the 
JPMS  will  provide  Air  Force  research  scientists  with  performance  measures  to  use  In  research  and 
development  (R4D)  projects.  Finally,  the  JPMS  will  provide  measures  for  assessing  how  well  the 
Armed  Services  Vocational  Aptitude  Battery  (ASVAB)  predicts  on-the-job  performance.  In  this 
paper,  we  examine  the  underlying  structure  of  the  JPMS,  and  the  predictive  efficiency  of  the 
ASVAB  for  the  JPMS  measures. 


BACKGROUND 

Job  performance  Is  a  complex  concept.  It  consists  of  several  dimensions  that  are  predicted 
by  many  human  attributes.  The  complexity  of  Job  performance  has  led  researchers  to  advocate  the 
use  of  multiple  measures  of  job  performance  that  are  homogeneous  In  content  and  relatively 
Independent  of  each  other  (Dunnette,  1963;  Gulon,  1976).  This  construct-oriented  approach 
clarifies  the  conception  of  job  performance  and  thereby  enhances  the  understanding  of  predictors 
of  job  performance. 

The  Air  Force's  JPMS  emphasizes  an  hierarchical  classification  of  job  performance,  as  well  as 
multiple  methods  and  sources  for  measurement  (Kavanagh,  Borman,  Hedge,  4  Gould  1986).  The 
broadest  classification  defines  the  components  of  job  performance  to  reflect  either  (a)  technical 
or  (b)  Interpersonal  aspects  of  work. 

At  the  next  level  of  hierarchy,  job  performance  components  are  classified  by  dimensions. 
Each  dimension  still  reflects  technical  or  Interpersonal  performance;  however,  the  concept  of 
performance  is  enriched  by  subclassifying  of  technical  and  Interpersonal  performance  Into 
dimensions. 

Many  subclassifications  of  performance  components  Into  dimensions  are  possible,  but  their 
content  usually  emphasizes  task,  behavior,  or  trait  Information.  The  Air  Force  uses  two 
subclasses:  One  emphasizes  task-oriented  and  the  other  trait-oriented  Information. 

The  task-oriented  dimensions  are  also  subclassified.  Each  dimension  Is  broken  down  Into  a 
set  of  interrelated  tasks  that  reflect  the  content  of  the  dimension.  Furthermore,  the  tasks  are 
broken  down  into  task  steps.  These  steps  reflect  the  elemental  or  "go*  versus  "no-go''  aspects  of 
task  performance. 

The  JPMS  also  emphasizes  the  use  of  multiple  methods  and  sources  for  job  performance 
Information.  Methods  Include  testing.  Interviewing,  or  rating.  Sources  are  the  individuals  who 
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provide  the  Information,  and  they  Include  peers,  Incumbents,  supervisors,  and  experts.  The  Air 
force  uses  testing  and  Interviewing  procedures  to  collect  Information  using  expert  test 
administrators,  as  well  as  rating  procedures  to  obtain  Information  from  incumbents,  supervisors, 
and  peers. 


PURPOSE 

The  purpose  of  the  paper  was  (a)  to  assess  the  structure  of  JPMS  measures  and.  If  necessary, 
revise  the  conception.  In  order  (b)  to  evaluate  the  ASVAB's  predictive  efficiency  In  terms  of 
these  measures. 


DATA  COLLECTION 


Participants 


Performance  data  were  collected  on  255  first-term  jet  engine  mechanics  at  13  geographical 
locations  In  the  United  States  early  In  1985. 

Job  Performance  Measures 

Job  performance  Information  was  obtained  with  testing,  Interviewing,  and  rating  procedures. 
The  Interviewing  and  testing  procedures  are  collectively  referred  to  as  Walk-Through  Performance 
Testing  (WTPT).  The  testing  component  Is  a  traditional  hands-on  performance  test  that  Is 
adnlnlstered  by  trained  personnel.  For  example,  a  hands-on  test  for  a  jet  engine  mechanic 
requires  the  Incumbent  to  Install  a  starter  on  a  jet  engine.  As  the  starter  Is  Installed,  the 
test  administrator  uses  a  checklist  to  Indicate  whether  each  task  step  Is  performed  correctly. 
The  Interviewing  component  Is  also  administered  systematically  by  trained  personnel.  It  requires 
the  Incumbent  to  explain  the  step-by-step  procedures  that  must  be  employed  for  successful  task 
performance.  The  distinction  between  the  hands-on  testing  and  Interviewing  Is  clear.  Hands-on 
testing  emphasizes  "can  do"  the  task,  while  Interviewing  emphasizes  “knows  how  to  do."  Hands-on 
data  were  obtained  on  10  tasks,  and  Interviewing  data  on  10  tasks.  Five  of  the  20  tasks  were 
common  to  both  hands-on  testing  and  Interviewing.  Separate  scores  were  obtained  for  the  unique 
hands-on  and  Interviewing  tasks  as  well  as  for  the  total  20  tasks.  The  WTPT  scores  provided  an 
indication  of  technical  proficiency. 

Ratings  were  obtained  for  all  the  content  levels  of  job  performance  from  Incumbents, 
supervisors,  and  peers.  The  task  rating  form  provided  the  most  specific  rating  data.  Thirty 
tasks  were  rated  with  5-polnt  scales  anchored  with  adjectives  at  each  point. 

The  task-oriented  dimensional  rating  form  required  Incumbents,  supervisors,  and  peers  to  rate 
technical  proficiency  on  task-oriented  dimensions.  Potential  dimensions  were  Identified  through 
factor  analysis  of  occupational  survey  data.  In  a  series  of  workshops  with  subject-matter 
experts,  the  dimension  definitions  and  representative  tasks  were  discussed,  and  5-polnt  rating 
scales  were  constructed  for  each  dimension.  Behavioral  descriptions  for  each  of  the  five  points 
were  developed  using  the  behavioral  summary  statement  approach  advocated  by  Borman  (1979).  The 
dimensions  were:  (a)  completion  of  forms;  (b)  remove/replace  engine  components;  (c)  Inspect 
engine;  (d)  quality  control;  (e)  shop  maintenance;  (f)  preparation  for  storage  and  shipping;  (g) 
fllghtllne  maintenance;  and  (h)  troubleshooting.  The  dimension  scores  were  averaged  to  obtain  an 
Indication  of  technical  proficiency  based  on  task-oriented  dimensions. 

The  trait-oriented  dimensional  rating  form  was  developed  to  be  representative  of  all 
specialties  In  the  Air  Force.  It  focuses  on  traits  that  distinguish  effective  performers  across 
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all  Jobs.  The  font  was  constructed  by  a  group  of  "resource  managers"  In  the  Air  Force  who  had 
managerial  responsibilities  for  a  large  nunber  of  specialties.  They  were  able  In  discussion  to 
compare  the  performance  requirements  of  several  specialties  and  reach  consensus  on  an 
Inter-specialty  perspective  of  performance.  In  addition,  the  managers  developed  5-polnt  rating 
scales  that  were  anchored  with  behavioral  summary  statements.  The  dimensions  were:  (a) 
technical  knowl edge/ski  1 1 ,  (b)  Initiative/effort,  (c)  knowledge  of  and  adherence  to 

regulatlons/orders,  (d)  Integrity,  (e)  leadership,  (f)  military  appearance,  (g)  sel f -development, 
and  (h)  self-control.  for  the  trait-oriented  rating  form,  the  technical  knowl  edge/sklll 
dimension  was  retained  as  an  Indication  of  technical  proficiency,  while  the  remaining  dimension 
scores  were  averaged  to  Indicate  Interpersonal  proficiency. 

The  global  rating  form  was  developed  to  measure  technical  and  Interpersonal  proficiencies 
needed  for  successful  performance.  The  two  Items  were  also  developed  In  a  workshop  setting.  The 
two  types  of  proficiency  were  discussed  and  defined,  and  the  behavioral  sumnary  approach  was  used 
to  place  specific  behavioral  descriptions  on  5-polnt  scales. 


Other  Measures 

The  ASVAB  scores  for  the  255  participants  were  obtained  from  their  personnel  records.  These 
scores  Included  values  for  the  10  ASVAB  subtests,  the  Armed  Forces  Qualification  Test  (AFQT),  and 
the  four  Air  Force  composites  (Mechanical,  Achilnl  strati  ve,  General,  and  Electronics).  The 
Mechanical  conposlte  Is  used  to  classify  personnel  to  the  Jet  Engine  Mechanic  Specialty. 

Two-hundred  and  three  participants  were  tested  with  ASVAB  Forms  8,  9,  or  10  (which  were 
operational  between  1980  and  1984);  48  participants  were  tested  with  ASVAB  Forms  5,  6,  or  7 
(operational  between  1976  and  1980).  One  participant  did  not  have  the  ASVAB  form  Identified  In 
his  records,  and  three  participants  did  not  have  ASVAB  data  In  their  records.  The  four  parti¬ 
cipants  with  Incomplete  ASVAB  data,  and  those  tested  with  Forms  5,  6,  and  7  were  eliminated  from 
the  sample.  Forms  8,  9,  and  10  are  calibrated  to  a  1980  reference  population,  and  Forms  5,  6, 
and  7  are  calibrated  to  a  1945  population.  Thus,  the  two  sets  of  forms  should  not  be  compared. 

Training  school  grades  were  also  available  from  the  personnel  records.  Since  training  grades 
have  frequently  been  used  to  describe  the  validity  of  ASVAB  predictors,  they  served  as  a 
comparison  criterion  for  the  JPMS  measures. 

RESULTS 

Structure  of  JPMS  Measures 

The  hypothesized  structure  of  the  JPMS  was  evaluated  using  confirmatory  factor  analysis 
(Joreskog,  1971).  In  this  approach  to  hypothesis  testing,  factor  structure  and  factor 
Intercorrelatlon  matrices  are  specified  to  Indicate  the  nature  of  the  latent  traits  or  factors 
that  underlie  the  measures. 

The  hypothesized  factor  structure  specified  two  general  performance  factors  (l.e. ,  technical 
and  Interpersonal  proficiency)  to  underlie  the  JPMS  measures.  In  addition,  four  factors  were 
hypothesized  as  methods  of  measurement.  These  method  factors  were  defined  by  the  MTPT  measures 
and  the  three  sources  who  provided  ratings  (l.e.,  self,  supervisor,  and  peer). 
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Table  5.  Factor  Structure  of  the  Job  Performance 
Measurement  System 


Variables 

Factors 

TECH 

SUPER 

SELF 

PEER 

INPERS 

Hands-on 

.973 

.290 

.182 

.247 

.012 

Interview 

.563 

.184 

.179 

.182 

-.169 

T-self 

.276 

.288 

.846 

.269 

-.017 

T-supervIsor 

.282 

.825 

.356 

.342 

-.088 

T-peer 

.247 

.414 

.338 

.838 

-.057 

D-self 

.241 

.340 

.872 

.295 

-.081 

D-supervI sor 

.368 

■  884 

.452 

.457 

-.101 

0-peer 

.448 

.437 

.419 

.855 

-.135 

A-sel  f-TK 

.248 

.406 

.806 

.344 

-.018 

A-self-IP 

-.078 

.278 

.577 

.200 

.570* 

A-supervIsor-TK 

.299 

.832 

.398 

.428 

.029 

A-supervIsor-IP 

.089 

.823 

.131 

.454 

.403 

A-peer-TK 

.352 

.474 

.292 

.768 

-.071 

A-peer-IP 

.079 

.308 

.091 

.682 

.433 

G-self-TK 

.258 

.286 

.747 

.212 

-.010 

G-self-IP 

.034 

.232 

.486 

.209 

.411* 

G-supervIsor-TK 

.361 

.803 

.331 

.457 

-.049 

G-supervIsor-IP 

.078 

.678 

.054 

.394 

.410 

G-peer-TK 

.352 

.498 

.353 

.697 

-.093 

G-peer-IP 

-.001 

.311 

.032 

.617 

.287 

♦Underl  Ined 

loadings  Indicate  marker  variables  for 

a  factor. 

Abbreviations  for  factors  are:  TECH  »  proficiency  of  the  technical 
aspects  of  work;  SUPER  *  overall  performance  from  supervisory  point  of 
view;  SELF  ■  overall  performance  from  Incumbent's  point  of  view;  PEER  » 
overall  performance  from  a  peer  point  of  view;  and  INPERS  »  overall 
performance  on  Interpersonal  factors  from  Incumbent,  supervisor,  and 
peer  points  of  view.  Abbreviations  for  variables  are  T  ■  task-level 
ratings;  0  *  task-oriented  dimensional -level  ratings;  A  *  Air  Force-wide 
dimensional -level  ratings;  G  ■  global-level  ratings;  TK  »  technical 
knowledge  and  skill;  and  IP  ■  Interpersonal  aspects  of  work. 

The  factor  Intercorrelatlon  matrix  was  hypothesized  to  contain  correlations  of  zero  between 
the  performance  and  method  factors.  However,  the  matrix  was  hypothesized  to  contain  nonzero 
correlations  between  the  performance  factors  and  between  the  method  factors. 

The  Hypothesized  factor  model  provided  a  poor  fit  to  the  JPMS  measures.  Indeed,  the  maximum 
likelihood  estimation  procedures  did  not  converge  to  a  proper  solution. 

Next,  an  exploratory  factor  analysis  procedure  was  used  to  discover  the  underlying  structure 
of  the  JPMS  measures  (Joreskog,  1969).  The  results  of  this  analysis  are  shown  In  Tables  5  and 
6.  As  shown  In  Table  S,  a  five-factor  solution  was  obtained  that  was  similar  to  the  hypothesized 
structure.  The  technical  and  Interpersonal  performance  factors  were  obtained,  as  well  as  three 
factors  for  the  three  sources  of  ratings.  However,  a  separate  factor  for  the  WTPT  measures  did 
not  appear  (l.e.,  hands-on  and  Interview).  These  measures  along  with  task,  task-related 
dimension,  and  global  technical  ratings  from  the  three  sources  defined  the  technical  performance 
factor. 
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As  shown  In  Table  6,  the  factors  had  low  to  moderate  correlations.  The  largest  correlation 
occurred  between  the  supervisor  and  peer  factors  (l.e. ,  .474),  and  It  agrees  with  previous 
research  (e.g.,  Kllmoskl  4  London,  1974;  Lawler,  1967). 

Table  6.  Correlations  Among  the  Factors  of 
the  Job  Performance  Measurement  System 


Factors 

TECH 

SUPER 

SELF 

PEER 

INPERS 

TECH 

1.000 

SUPER 

.268 

1.000 

SELF 

.234 

.319 

1.000 

PEER 

.248 

.474 

.254 

1.000 

INPERS 

-.206 

.107 

-.001 

.112 

1.000 

Note.  Abbreviations  are 

defined  In 

Table  1. 

Predictive 

Efficiency  of  ASVAB  for  JPMS 

The  ability  of  the  ASVAB  to  predict  the  JPMS  measures  was  evaluated  using  correlations  that 
were  and  were  not  corrected  for  range  restriction.  As  noted  by  advisory  groups  to  the 
Joint-Service  Job  Performance  Measurement  Project,  the  corrected  correlations  provide  a  cornnon 
basis  for  Interpretation.  The  base  group  used  for  correction  was  the  1980  Youth  Population 
(Department  of  Defense,  1982),  and  the  Pearson-Lawley  procedure  was  used  for  multivariate 
correction  on  the  10  subtests  of  the  ASVAB  (Mifflin  6  Verna,  1977). 

Table  7  contains  subtest  Intercorrelatlons  for  the  1980  Youth  Population  (upper  triangle)  and 
for  the  samjle  of  jet  engine  mechanics  (lower  triangle).  A  comparison  of  the  correlations 
indicates  large  discrepancies  between  some  correlations.  For  example,  the  correlation  between 
general  science  (GS)  and  arithmetic  reasoning  (AR)  for  the  Youth  Population  Is  .722,  while  for 
the  mechanics  It  Is  .221.  Such  discrepancies  Indicate  that  the  results  for  corrected 
correlations  should  be  Interpreted  cautiously. 

Table  7.  Intercorrelatlons  Among  the  ASVAB  Subtests  for  the 
1980  Youth  Population  (Upper  Triangle)  and  for  the 
Sample  of  Jet  Engine  Mechanics  (Lower  Triangle) 


ASVAB  Subtests 


GS 

AR 

WK 

PC 

NO 

CS 

AS 

MK 

MC 

El 

GS 

1.000 

.722 

.802 

.690 

.525 

.452 

.637 

.695 

.696 

.760 

AR 

.221 

1.000 

.709 

.671 

.626 

.515 

.533 

.828 

.685 

.658 

UK 

.622 

.296 

1.000 

.803 

.618 

.552 

.529 

.671 

.595 

.684 

PC 

.369 

.281 

.516 

1.000 

.608 

.561 

.423 

.637 

.522 

.572 

NO 

.063 

.450 

.145 

.256 

1.000 

.701 

.307 

.617 

.409 

.422 

CS 

.025 

.267 

.033 

.104 

.497 

1.000 

.225 

.520 

.336 

.341 

AS 

.342 

.217 

.263 

.207 

-.019 

-.022 

1.000 

.415 

.741 

.746 

MK 

.285 

.695 

.352 

.391 

.465 

.312 

.092 

1.000 

.601 

.585 

MC 

.326 

.457 

.338 

.226 

.134 

.139 

.422 

.416 

1.000 

.744 

El 

.402 

.233 

.385 

.291 

-.002 

.082 

.530 

.235 

.502 

1.000 

Mote.  Abbreviations  for  the  subtests  are  GS  ■  General  science;  AR  ■ 
Arithmetic  reasoning;  UK  ■  Word  knowledge;  PC  *  Paragraph  comprehension;  NO 
*  Numerical  operations;  CS  ■  Coding  speed;  AS  ■  Auto  and  shop  Information; 
MK  «  Mathematics  knowledge;  MC  *  Mechanical  comprehension;  and  El  ■ 
Electronics  Information.  All  sample  correlations  greater  In  absolute  size 
than  .138  are  significant  at  _p  <  .05. 
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I  JPMS 

ASVAB  predictors 

I  measures 

AFQT 

M 

A  G 

E 

Subtests* 

INPERS 

.107 

TOTWTPT 

.150 

Hands-on 

.120 

Interview 

.187* 

TGRD 

.477** 

|  |  t.m'l.a'B.i 


The  correlations  between  ASVAB  predictors  and  JPMS  measures  are  reported  in  Tables  8  and  9. 
Scores  for  the  JPMS  factors  were  obtained  by  a  weighted  average  of  the  marker  variable  scores. 
All  the  marker  variables  had  a  weight  of  1.0,  except  the  hands-on  measure.  It  was  weighted  2.0 
because  of  Its  greater  Importance  In  defining  the  factor  of  technical  proficiency. 

The  ASVA8  modestly  predicted  the  JPMS  measures.  The  corrected  and  uncorrected  correlations 
suggest  that  tne  Mechanical  and  Electronics  composites,  as  well  as  the  10  subtests,  are  most 
predictive  of  the  JPMS  measures.  Furthermore,  the  magnitude  of  the  Electronics  and  subtest 
correlations  relative  to  those  of  the  Mechanical  composite  suggest  that  the  ASVAB  has  additional 
predictive  power  that  could  be  used  to  classify  personnel  to  the  Jet  Engine  Mechanic  specialty. 


Table  8.  Correlations  Between  ASVAB  Predictors  and 
JPMS  Measures  (Uncorrected  for  Restriction) 


tlon  Test;  M  ■  Mechanical;  A  ■  Acfemlnl strati ve;  G  »  General;  and  E  »  Elec¬ 
tronics.  The  abbreviation  TOTWTPT  is  for  the  total  score  obtained  with 
Walk-Through  Performance  Testing  and  TGRD  Is  the  grade  received  In 
technical  training.  See  Tables  1  and  3  for  the  remaining  abbreviations. 
aThe  values  reported  for  the  subtests  are  multiple  correlations. 

*JL  < 

**p  <  .  01 . 

The  ASVAB  was  a  somewhat  better  predictor  of  the  Interviewing  measure  than  of  the  hands-on 

measure.  A  probable  explanation  for  this  finding  Is  a  cotmaon  requirement  of  verbal  ability  for 

ASVAB  predictors  and  the  Interviewing  measure.  The  Interview  requires  the  Incumbent  to  "show  and 

tell"  how  to  do  the  task,  whereas  the  hands-on  measure  requires  the  Incumbent  "to  do"  the  task. 

The  ASVAB  did  better  In  predicting  training  school  grades  than  It  did  In  predicting  JPMS 
measures.  Historically,  the  ASVAB  has  been  revised  on  the  basis  of  Its  relationship  to  training 
school  grades;  so,  this  finding  Is  not  surprising.  However,  It  is  Important  to  assess  whether 
the  ASVAB  has  any  predictive  efficiency  that  Is  unique  to  measures  of  on-the-job  performance. 

Finally,  a  comparison  of  uncorrected  and  corrected  correlations  In  Tables  8  and  9  Indicates 
that  several  correlations  were  reduced  In  magnitude  upon  correction.  These  reductions  were 
apparently  due  to  greater  standard  deviations  for  some  of  the  ASVAB  subtests  in  the  Jet  engine 
mechanic  data  compared  to  Youth  Population  data. 


1 


«,V 

*• 

$ 

ItM 


Tabic  9.  Correlations  Between  ASVAB  Predictors 
and  JPHS  Measures  (Corrected  for  Restriction) 


JPMS 

M ensures 

ASVAB 

predictors 

AFQT 

M 

A 

0 

E 

Sub tests4 

TECH 

.219 

.244* 

.159 

.220 

.237* 

.306** 

SUPER 

.  1 75 

.152 

.215 

.1 66 

.188 

.289 

SELF 

-.134 

-.051 

-.191 

-.153 

-.089 

.357* 

PEER 

.324 

.232 

.350 

.300 

.292 

.368 

INPERS 

.211 

.135 

.190 

.180 

.173 

.235 

T0TWTPT 

.232 

.245* 

.193 

.231 

.244* 

'  .  .307** 

Hands-on 

.204 

.213 

.159 

.188 

.198 

.292* 

Interview 

.180* 

.224* 

.111 

.209* 

.232** 

.276* 

TGRD 

.621** 

.543** 

.582**  .620** 

.627** 

.671** 

Note. 

Abbreviations  are 

defined 

In  earlier 

tables. 

Statistical 

significance  reported  for  uncorrected  correlations  In  Table  4  Is  Infer¬ 
red  for  correlations  In  this  table. 

4The  values  reported  for  the  subtests  are  multiple  correlations. 

*p_  <  .05. 

**p  <  .01. 

Several  sets  of  multivariate  regression  analyses,  as  reported  In  Tables  10  and  11,  were 
performed  to  assess  the  unique  predictive  efficiency  of  the  ASVAB  subtests.  The  WTPT  measures 
were  analysed  separately  and  as  a  composite.  In  addition,  two  orders  of  entry  were  specified  for 
Roy-Bargman  step-down  tests  (Bock,  1975).  One  order  addressed  the  question:  Does  the  ASVAB 
predict  performance  In  training  that  Is  uniquely  different  from  on-the-job  performance?  Sets  1 
and  2  addressed  this  question.  The  second  order  addressed:  Does  the  ASVAB  predict  job 
performance  uniquely  beyond  that  of  training  performance? 

Table  10.  Siamury  of  Roy-Bargman  Step-Down  Tests 
(Uncorrected  for  Restriction) 


Analysis  _ df _  Mean  squares 


Set 

Order 

HYP 

MSE 

HYP 

MSE 

F-ratlo 

1 

TOTWTPT 

10 

178 

437.86 

222.8 1 

1.96* 

TGRD 

10 

177 

258.50 

29.33 

8.81** 

2 

Hands-on 

10 

178 

196.42 

104.90 

1.87 

Interview 

10 

177 

104.51 

107.07 

.98 

TGRD 

10 

176 

259.73 

29.22 

8.88** 

3 

TGRD 

10 

178 

283.80 

30.06 

9.44** 

TOTWTPT 

10 

177 

333.98 

217.45 

1.53 

4 

TGRD 

10 

178 

283.80 

30.06 

9.44** 

Hands-on 

10 

177 

177.38 

101.42 

1.75 

Interview 

10 

176 

82.70 

107.67 

.77 

Note.  Abbreviations  are 

HYP  ■ 

Hypothesis  and 

MSE  ■  Mean 

square 

error.  The  remaining  abbreviations  are  defined  In  earlier  tables. 
<  .05. 
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As  shown  by  the  step-down  tests  In  sets  1  and  2,  the  training  performance  predicted  by  ASVA8 
is  uniquely  different  fro*  on-the-job  performance.  The  step-down  tests  shown  for  sets  3  and  4 
suggest  that  the  ASVA8  does  not  predict  unique  on-the-job  performance  beyond  that  accounted  for 
in  training  grades.  These  results  occurred  for  both  corrected  and  uncorrected  correlations. 


Table  11, 


StnwMry  of  Roy-flargman  Step-Down  Tests 
(Corrected  for  Restriction) 


Analysis 


Mean  squares 


Set 

Order 

HYP 

MSE 

HYP 

MSE 

F-ratlo 

1 

TOTWTPT 

10 

178 

422.51 

227.72 

1.86* 

TGRO 

10 

177 

366.23 

28.47 

12.86** 

2 

Hands-on 

10 

178 

174.98 

105.75 

1.65 

Interview 

10 

177 

84.84 

105.18 

.81 

TGRD 

10 

176 

377.10 

28.17 

13.38** 

3 

TGRD 

10 

178 

421.91 

29.00 

14.55** 

TOTWTPT 

10 

177 

194.44 

223.60 

.87 

4 

TGRD 

10 

178 

421.91 

29.00 

14.55** 

Hands-on 

10 

177 

120.64 

105.78 

1.14 

Interview 

10 

176 

59.17 

102.78 

.58 

Mote.  The  abbreviations  are  defined  In  earlier  tables.  Statistical 
significance  reported  for  uncorrected  tests  In  Table  10  Is  Inferred  for 
tests  In  this  table. 

*2  <  .05. 

**p  <  .01. 


DISCUSSION 


This  paper  has  presented  an  Investigation  into  the  structure  of  the  JPMS.  The  results  were 
encouraging.  In  that  five  of  the  six  factors  obtained  were  previously  hypothesized  to  describe 
the  JPMS.  The  WTPT  factor  did  not  ewerge,  but  It  was  part  of  the  hypothesized  factor  of 
technical  proficiency.  This  interpretation  is  supported  by  the  loadings  of  the  rating  measures 
on  the  technical  proficiency  factor.  Substantial  loadings  were  obtained  for  all  the  rating 
measures  that  tapped  technical  proficiency  (factors  TECH,  SUPER,  SELF,  and  PEER  In  Table  1), 
while  near-zero  loadings  were  obtained  for  the  rating  measures  that  tapped  Interpersonal 
proficiency  (factor  INFERS  In  Table  1). 


The  Interpersonal  factor  also  appeared  as  hypothesized.  Although  the  factor  loadings  of 
marker  variables  were  smaller  In  magnitude  than  those  for  technical  proficiency,  a  comon 
Interpersonal  proficiency  factor  was  found  In  ratings  by  Incumbents,  supervisors,  and  peers. 
Perhaps,  this  factor  will  appear  stronger  In  the  application  of  the  JPMS  to  Air  Force  specialties 
that  are  service-oriented  (e.g..  Personnel  technician). 


The  meaning  of  the  source-of-ratlng  factors  remains  a  topic  for  future  research.  The 
loadings  of  the  self,  supervisory,  and  peer  factors  suggest  that  both  technical  and  Interpersonal 
aspects  of  the  work  are  being  tapped.  Of  course,  the  factors  may  Indicate  nothing  more  than 
unique  sources  of  bias.  If  so,  the  general  lack  of  predictive  efficiency  of  the  ASVA8  for  these 
factors  Is  understandable. 
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This  paper  also  Investigated  the  ability  of  the  ASVAB  to  predict  the  JPMS  measure.  In 
general ,  the  results  Indicated  modest  success  in  predicting  on-the-job  performance.  Only  the 
WTPT  measures  appeared  to  be  predicted  by  the  ASVAB. 

Of  the  ASVAB  conposltes,  the  Mechanical  and  Electronics  appeared  to  be  most  valid.  This 
occurred  for  both  uncorrected  and  corrected  correlations.  This  lent  some  confidence  to  the  use 
of  corrected  correlations.  However,  until  JPHS  results  from  other  specialties  are  available,  the 
interpretation  of  corrected  correlations  must  remain  cautious. 

In  conclusion,  the  results  of  this  paper  Indicate  that  (a)  JPMS  has  a  meaningful  structure 
which  Is  quite  similar  to  the  hypothesized  structure,  and  (b)  ASVAB  predictors  are  only  modestly 
predictive  of  the  on-the-job  performance  of  jet  engine  mechanics. 
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IV.  AIR  FORCE  JOB  PERFORMANCE  MEASUREMENT  TECHNOLOGY 
APPLIED  TO  TRAINING 


Jack  L.  Blackhurst 
Rodger  0.  Ballentine 
Martin  W.  Pellum 

Air  Force  Human  Resources  Laboratory 

OVERVIEW 

All  Air  Force  training  programs  are  evaluated  In  some  way,  whether  It  be  asking  trainees 
their  opinions  of  the  training,  or  assessing  the  effect  of  the  training  on  some  work-related 
variables.  Since  most  training  Is  aimed  at  affecting  job  performance  In  some  way,  obtaining 
Indices  of  job  performance  Is  critical  to  both  the  Initial  design  of  the  training  program,  and 
the  subsequent  evaluation  of  the  program's  success.  Resident  technical  training  schools  rely  on 
written  and  hands-on  tests  to  determine  how  well  the  training  objectives  are  being  met.  These 
schools  also  depend  on  inputs  from  supervisors  of  recent  graduates  to  ensure  the  curricula  are 
preparing  the  students  to  effectively  meet  job  demands.  Supervisors  and  Inspection  teams  use 
performance  appraisals  In  making  decisions  about  trainee  proficiency  and  training  program 
soundness.  This  paper  addresses  how  performance  measurement  Is  currently  utilized  In  resident 
technical  training  and  In  on-the-job  training,  and  how  advances  In  job  performance  measurement 
could  aid  In  these  endeavors.  (Although  the  technical  training  comments  In  this  paper  focus  on 
Initial  training,  the  concepts  apply  equally  well  to  later  phases  of  resident  technical  training.) 


PERFORMANCE  MEASUREMENT  IN  TRAINING  TODAY 
Initial  Resident  Technical  Training 


Initial  resident  technical  training  closely  parallels  the  educational  environment  of  civilian 
vocational  schools.  Courses  are  structured  In  their  length,  presentation  methods,  and  evaluation 
processes.  Students  learn  In  both  classroom  and  laboratory  settings.  The  training  curriculum  Is 
derived  directly  from  the  Specialty  Training  Standard  (STS)  using  task  analysis  and  Instructional 
Systems  Development  practices  (AFM  50-2;  AFP  50-58;  DeVries,  Eschenbrenner,  i  Ruck,  1980)  and  Is 
Intended  to  prepare  students  for  a  variety  of  potential  jobs  within  a  specialty  IATCX  52-3). 
Thus,  the  training  Is  typically  aimed  at  producing  graduates  who  are  semi-skilled  or  ‘partially 
proficient"  on  many  tasks,  rather  than  skilled  or  “proficient"  on  a  few  tasks.  This  fact 
Introduces  some  unique  considerations  when  applying  job  performance  measurement  technologies, 
which  will  be  discussed  later  In  this  paper. 

In  resident  technical  training,  performance  appraisal  plays  a  key  role  In  evaluating  success 
of  both  the  students  and  the  training  program.  The  primary  purpose  of  student  performance 
measurement  within  the  resident  school  Is  to  ensure  the  training  objectives  are  being  met  so  the 
student  may  proceed  through  the  course  or  graduate.  However,  such  Information  Is  also  used  to 
evaluate  the  quality  of  the  training  Itself. 

Student  accomplishment  of  each  course  objective  Is  formally  assessed  via  written  and/or 
hands-on  performance  tests.  Such  checks  typically  Involve  demonstrating  skills  and  knowledge 
required  for  task  performance  to  the  "partially  proficient*  level.  For  exam>1e,  with  hands-on 
performance  tests,  students  may  receive  assistance  from  Instructors  or  they  may  perform  only  the 
less  difficult  parts  of  the  task.  When  tasks  require  a  team  effort,  students  may  not  have  the 
opportunity  to  perform  all  aspects  of  a  task,  and  therefore,  evaluation  reflects  their  team 
participation  more  than  their  Individual  performance. 
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One  Informal  assessment  technique  used  throughout  resident  training  Is  the  verbal  quiz. 
During  class  discussion  or  task  performance,  students  explain  where  they  would  go  for 
Information,  what  they  are  doing,  why  they  are  doing  it,  etc.  This  Informal  evaluation  seems  to 
be  a  powerful  measure  of  student  understanding  of  the  underlying  systems  concepts. 

Evaluation  of  resident  technical  training  programs  also  derives  from  appraisals  of  graduate 
performance  on  the  Job  through  supervisor  questionnaires,  training  quality  reports  ( TQRs ) ,  and 
training  center  field  visits  (ATCR  52-12).  About  3  months  after  completing  technical  school, 
graduates  and  their  supervisors  complete  rating  forms  covering  how  well  the  training  program 
prepared  the  graduates  for  duty,  using  the  STS  task  listing  as  a  guide.  The  TQR  program  enables 
anyone  to  express  an  observation  concerning  training  program  results.  Usually  this  Is  a 
perceived  training  deficiency  In  graduates,  either  where  STS  standards  are  not  being  met  or  the 
training  Is  Inadequate  given  on-the-job  requirements.  Finally,  field  visits  are  conducted  by  a 
training  evaluation  team  In  which  recent  graduates  are  Interviewed  and  often  asked  to  perform 
certain  tasks  taught  In  the  resident  course.  All  of  this  external  performance  Information  Is  fed 
back  to  the  resident  training  managers  for  consideration  and  action.  If  necessary.  It  Is  also 
combined  with  other  (Internal)  training  Information  (test  scores,  washback  rates,  etc.)  and 
compiled  Into  a  training  evaluation  report  (AFR  50-38). 


On-the-Job  Tral  nil* 

The  on-the-job  training  (OJT)  coefionent  is  considerably  different  from  resident  technical 
training.  OJT  Is  "dual -channeled"  In  that  Individuals  receive  training  on  their  entire  specialty 
through  career  development  courses  (CDCs),  and  training  on  their  particular  jobs  through 
Interactions  with  supervisors/trainers.  The  training  occurs  In  the  operational  environment  where 
newly  assigned  trainees  learn  a  job  and  contribute  to  the  mission  at  the  same  time.  This  places 
different  constraints  on  OJT  than  those  experienced  In  Initial  technical  training;  as  job 
pressures  Increase,  training  may  suffer.  Though  guidance  on  how  to  conduct  OJT  Is  outlined  In 
several  different  places  (e.g.,  AFR  50-23,  AFW  66-1),  OJT  remains  a  flexible,  albeit  unstructured 
process.  The  STS  serves  primarily  as  a  guide  for  the  level  of  proficiency  required  for  task 
certification,  while  the  supervisor  has  discretion  over  what  Is  trained.  (In  some  cases,  major 
commands  may  require  that  certain  tasks  be  Included  In  this  training  by  Issuing  comnand-level  job 
performance  guides.) 

Typically,  OJT  is  administered  In  a  one-on-one  fashion,  with  an  expert  training  a  novice. 
Generally,  there  are  limited  Instructional  materials,  and  training  practices  may  vary  greatly 
from  one  supervisor/trainer  to  another.  Performance  evaluation  practices  may  vary  greatly  as 
well.  Prior  to  certifying  a  trainee  on  a  task,  the  supervisor  must  ensure  the  trainee  Is  able  to 
perform  at  the  given  level  of  proficiency.  The  supervisor  may  evaluate  trainee  job  performance 
any  nunber  of  ways,  from  hands-on  demonstrations  to  Inspections  of  end  products.  CDC  performance 
Is  measured  by  written  volume  review  exercises  and  course  examinations. 

Performance  appraisals  are  also  used  by  base  and  major  contend  Inspection  teams.  Individuals 
are  selected  to  perform  certain  tasks,  and  their  performance  Is  evaluated  against  the  standards 
found  In  the  applicable  regulations  and  technical  orders.  These  Individual  evaluations  could  be 
used  as  a  check  on  the  training  system  as  a  whole. 


APPLYING  PERFORMANCE  MEASUREMENT  TECHNOLOGY  TO  TRAINING 
Background 

In  1983,  the  Air  Training  Ccewand  formally  expressed  the  need  for  research  linking  resident 
technical  training  to  Individual  Job  performance.  Two  separate  research  efforts  exploring  the 
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Phase  II:  Tralnlr 


This  phase  reflects  the  action  of  Instructing,  material  presentation.  Involvement  of  the 
trainees,  and  the  methods  used  to  assess  trainee  progress.  Since  much  of  the  success  of  a 
training  program  rests  with  the  ability  of  the  Instructor  to  convey  the  material,  research  on 
methods  for  "training  the  trainer"  and  evaluating  Instructor  performance  appears  warranted. 
Performance  measurement  development  methods  and  formats  from  the  emerging  performance  technology 
could  assist  In  this  area.  Student  measurement  Is  another  area  that  could  benefit  from  this 
technology.  Though  the  present  Internal  training  evaluation  methods  appear  to  work,  their 
validity  and  reliability  are  rarely  assessed.  Additional  or  different  methods  might  yield 
greater  Information  or  Increase  the  validity  of  Information  collected,  leading  to  better 
adalnl strati ve  decisions. 
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potential  Integration  of  performance  measurement  technologies  In  the  technical  training 

evaluation  process  have  been  completed  since  that  time  (Banks,  1987;  Drlsklll,  Mitchell,  S 

Ballentlne,  1985).  As  a  result  of  these  tvc  studies,  there  exists  a  clearer  notion  of  how  and 
where  the  job  performance  measurement  process  and  Information  could  benefit  the  technical 

training  community.  Research  on  performance  measurement  In  OJT  arose  from  a  different  set  of 
circumstances.  Following  an  AF-wlde  Inspection  of  the  entire  00T  system  In  1977,  work  was 
Initiated  to  improve  all  aspects  of  OJT  through  the  application  of  state-of-the-art  computer  and 
automation  technology  (Carson,  Chambers,  4  Gosc,  1984).  Within  this  paper,  coimtents  on  how  and 

where  performance  measurement  could  benefit  OJT  will  be  confined  to  applications  within  the 
Advanced  On-the-job  Training  System  (AOTS). 


Initial  Resident  Technical  Training 


The  resident  technical  training  process  can  be  divided,  for  discussion  purposes.  Into  three 
phases.  The  nature  of  the  performance  Information  gathered  and  Its  use  may  differ  depending  on 
which  phase  Is  targeted.  Therefore,  each  phase  will  be  treated  separately,  even  though  there  are 
obvious  overlaps  between  them. 


Phase  1:  Pre-Training 

The  processes  that  occur  within  this  phase  all  relate  to  development  of  the  training 
program.  This  phase  encompasses  the  Identification  of  tasks  performed  by  the  target  group 
(usually  first-term  airmen),  and  the  translation  of  this  Information  into  training  objectives, 
and  ultimately  lesson  plans.  In  essence.  It  Includes  anything  that  occurs  prior  to  the  active 
training  of  Individuals. 

Two  areas  In  this  phase  could  benefit  from  measures  of  job  performance.  First,  the  validity 
of  the  Instructional  Systems  Development  (ISO)  model  could  be  determined  by  eaplrlcally  linking 
the  developed  training  curriculum  to  job  performance.  Though  this  feedback  loop  currently 
exists,  the  methods  used  (l.e. ,  surveys  and  Interviews)  have  not  been  con^ared  to  more  rigorous 
methods  such  as  performance  tests.  (See  the  Phase  III  discussion  below.)  Thus,  one  cannot  be 
sure  of  the  quality  of  the  Information  that  Is  currently  gathered.  Second,  the  behaviors 
associated  with  given  levels  of  proficiency  may  or  may  not  be  the  critical  ones  needed  upon  entry 
on  the  job.  Since  not  all  aspects  of  task  performance  can  be  covered  1.f  resident  training. 
Information  Is  needed  that  would  enable  trainers  to  more  precisely  define  the  course  content.  In 
a  manner  that  best  meets  the  Job  requirements,  supervisor  expectations,  and  training 
constraints.  Job  performance  measures  could  give  this  entry-level  task  proficiency  Information. 
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Phase  III:  Post-Training 

Determining  whether  the  training  program  Improved  subsequent  job  performance  Is  the  key 
function  of  this  phase.  It  forms  the  external  feedback  loop  needed  to  validate  the  training 

process  begun  In  Phase  I.  Several  different  approaches  have  been  taken  to  empirically  relate 

what  Is  trained  to  what  needs  to  be  trained.  One  could  use  surveys  or  subject-matter  expert 

(SME)  judgments.  Pennell,  Harris,  and  Schwllle,  (1976)  developed  and  demonstrated  a 

factor-analysis-based  methodology  for  using  the  existing  supervisor  surveys  of  recent  resident 
technical  training  graduates'  performance.  Based  on  supervisor  ratings  of  performance,  they 
identified  areas  that  were  potentially  being  overtrained  or  undertrained.  Ford  and  Wroten  (1984) 
achieved  the  same  basic  end  through  the  calculation  of  content  validity  ratios  (CVRs)  using  SME 
Inputs.  They  used  these  CVRs  to  match  training  needs  to  current  training  emphases  at  the 
training  category  and  subcategory  levels. 


Very  little  work  exists  that  relates  hands-on  measures  of  job  performance  to  training  program 
content  or  methods.  For  this  reason,  training  course  developers  often  lack  the  Information 
needed  to  make  decisions  on  how  the  training  curriculum  can  be  structured  to  provide  the  greatest 
transfer  of  learning  to  job.  One  could  obtain  measures  of  productivity  or  effectiveness,  such  as 
maintenance  downtime  or  sortie  flying  rates,  and  relate  these  to  the  training.  These  "production 
functions"  would  enable  one  to  estimate  the  costs  and  benefits  associated  with  particular 
training  program  designs  (Solomon,  1986).  However,  these  measures  are  often  difficult  to  obtain, 
are  affected  by  nunerous  other  factors  besides  training,  and  have  been  shown  to  not  always  be 
reliable  (Gibson  i  Orlansky,  1986). 


Another  way  to  obtain  the  needed  Job  performance  Information  would  be  to  develop  tests  of 
tasks  Individuals  perform  In  their  Jobs.  The  Air  Force's  Walk-Through  Performance  Testing  (WTPT) 
technology  (Hedge  i  Teachout,  1986)  has  been  used  to  obtain  such  measures  of  Job  performance  for 
selection  validation  purposes  (Dickinson,  Hedge,  i  Ballentlne,  1987).  This  same  Information  has 
also  been  related  to  resident  technical  training  achievement  (Hedge,  Ballentlne,  i  Gould,  1985). 
What  Is  yet  to  be  developed,  however.  Is  a  WTPT-Ilke  system  geared  specifically  toward 
Identifying  training  needs. 


On-the-Job  Training 


It  Is  conceivable  that  measures  could  be  developed  that  could  evaluate  the  training  conducted 
prior  to  an  Individual  reaching  his/her  job  and  Identify  additional  OJT  training  needs.  If 
designed  for  and  administered  to  recent  resident  technical  training  graduates,  the  results  of 
such  measures  could  be  fed  back  to  the  training  center  and  used  by  the  supervisor  to  determine 
the  point  at  which  to  start  the  person  In  OJT.  If  the  measures  reflected  tasks  learned  primarily 
In  OJT,  the  results  could  be  used  by  an  Individual's  new  supervisor  as  a  diagnostic  tool  In 
Identifying  the  extent  of  training  necessary  to  make  the  person  qualified  for  a  position  held. 
Both  of  these  concepts  fit  within  the  Advanced  On-the-job  Training  System  (AOTS)  scenario 
discussed  below. 


The  Advanced  On-the-Job  Training  System 

The  AOTS  will  provide  the  capability  of  real-time  management  of  all  aspects  of  OJT  to  Include 
automatic  record-keeping  and  updates,  as  well  as  scheduling  and  tracking  trainee  progress.  In 
addition.  It  will  provide  specialty  training  requirements  at  multiple  levels  (e.g.,  by  position, 
unit,  or  specialty).  The  system  uses  a  coaputer  management  system  with  both  training  and 
evaluation  components  (see  Figure  3). 
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Figure  3.  AOTS  Subsystems. 

The  training  and  evaluation  components  will  provide  the  Air  Force  with  the  capability  of 

on-line  training  and  evaluation.  SMEs  will  identify  the  task  training  requirements  and  actually 

build  the  training  and  evaluation  materials  for  supervisors—making  it  a  real  "blue-suit" 
system.  Training  requirements  can  then  be  matched  to  a  particular  position,  rather  than  to  an 
entire  specialty  as  is  currently  done.  The  results  of  the  evaluations  will  be  updated 

automatically  or  through  optical  scan  devices.  The  system  is  designed  to  be  available  to  the 
supervisor  and  trainee  in  the  work  center,  where  OJT  is  accomplished  today.  The  system  should 
allow  training  to  occur  without  the  need  for  a  supervisor  to  be  present  for  every  task,  since 
computer-based  training  will  be  used  for  some  tasks.  For  those  critical  tasks  that  need 

one-on-one  training,  the  supervisor  should  be  able  to  spend  more  quality  training  time  with  the 
trainee. 
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V.  INTER-SERVICE  TECHNOLOGY  TRANSFER:  PROMISE  AND  PAYOFF 

Robert  E.  Duncan 
Dorian  A.  Hodge 
Jack  L.  Blackhurst 

Air  Force  Human  Resources  Laboratory 

As  the  lead  Service  for  job  performance  measurement  (JPM)  technology  transfer,  the  Air  Force 
has  Initiated  a  variety  of  efforts  to  Investigate  the  feasibility  not  only  of  sharing  previously 
developed  measures  but  also  of  sharing  developmental  approaches.  In  the  past,  the  najor 
rationale  behind  technology  transfer  has  been  cost  savings  (Alba,  1986,  p.  1).  Today,  that 
rationale  remains.  However,  as  we  look  toward  the  future  of  performance  measurement  technology, 
greater  emphasis  needs  to  be  given  to  planning  and  executing  a  plan  that  expands  the  rationale 
for  the  transfer  of  performance  measurement  technology  beyond  cost  savings/cost  avoidance.  The 
transfer  of  previously  developed  Instruments  has  great  potential,  but  the  greater  promise  and 
payoff  lay  In  transferring  Service-specific  measurement  approaches  to  provide  optimum  validation 
of  Armed  Service  Vocational  Aptitude  Battr-y  (ASVAB)  scores. 

This  paper  will  examine:  (a)  the  scope  of  technology  transfer;  (b)  past,  present,  and  future 
transfer  activities;  and  (c)  the  promise  of  future  JPM  technology  transfer  and  Its  potential 
payoff  to  the  Department  of  Defense  (DoO)  and  private  industry. 


BACKGROUND 

The  DoD,  In  response  to  a  Congressional  mandate.  Is  coordinating  an  effort  among  the  Services 
to  develop  job  performance  measures  for  the  validation  of  ASVAB.  This  effort  Is  being  directed 
by  the  Joint-Service  Job  Performance  Measurement  Working  Group  (JSUPMWG)  composed  of  Service 
representatives  and  a  representative  from  the  National  Academy  of  Sciences,  the  organization 
responsible  for  providing  technical  oversight  and  advice.  Early  In  the  effort,  the  JSJPMWG 
determined  a  need  existed  for  ensuring  JPM  technology  was  shared  (transferred)  among  the  Services 
to  the  maximum  extent  possible.  As  a  result,  the  Air  Force  was  selected  as  the  lead  Service  for 
technology  transfer.  The  transfer  efforts,  Initiated  by  the  Air  Force  In  1985,  are  ongoing  and 
will  be  more  thoroughly  discussed  later  In  this  paper. 


THE  SCOPE  OF  TECHNOLOGY  TRANSFER 

JPM  technology  transfer  can  be  described  as  a  two-pronged  effort.  We  call  the  first 
component  "technology  transfer,"  where  technology  refers  to  the  procedures  the  Services  use  In 
developing  their  respective  hands-on  and  surrogate  measures.  In  some  Services  these  procedures 
are  well  documented,  while  In  others  the  documentation  is  currently  being  prepared  or  Is  entirely 
absent.  Although  these  last  two  conditions  make  technology  transfer  more  difficult.  It  is 
Important  to  pursue  the  transfer  of  Service-specific  approaches  to  determine  where  modifications 
need  to  be  made.  The  approach  being  followed  to  transfer  Service-specific  technology  to  other 
Services  Involves  the  active  and  cooperative  Interaction  between  both  the  donating  and  receiving 
Services. 

Through  modification  of  task  sampling  and  analysis  techniques  (If  required),  the  receiving 
Service  follows  the  specific  procedures  used  by  the  donating  Service  to  develop  tl  *  lather’s 
surrogate  measures  (as  transfer,  at  this  time,  Is  centered  around  surrogates).  Such  a  procedure 
ensures  that  a  receiving  Service  Is  developing  a  surrogate  measure  that  would  exactly  parallel 
the  surrogate  developed  by  the  donating  Service.  Technology  transfer,  as  described  here,  will 
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yield  the  greatest  long-term  payoff--a  series  of  JPM  technologies  usable  not  only  by  the  Services 
but  also  by  the  private  sector.  As  a  receiver  of  technology,  the  Air  Force  Is  attempting  to 
document  the  development  of  another  Service's  surrogate  In  an  ongoing  transfer  effort.  Such 
action  Is  required  to  design  a  truly  uniform  Joint-Service  JPM  technology. 

The  second  component  of  transfer  Involves  what  we  call  “transfer-in-kind."  Transfer-In-kind 
Is  simply  the  application,  with  minor  modifications  of  nomenclature,  of  the  performance  measures 
developed  for  a  specific  specialty  of  the  donating  Service  to  a  very  similar  specialty  of  the 
receiving  Service.  The  major  benefit  of  this  type  of  transfer  is  short-term  cost  avoidance 
( 1 . e . ,  avoiding  the  costs  associated  with  the  full  development  of  performance  measures  in  similar 
specialties).  Past  transfer  efforts,  to  be  described  later,  have  focused  on  this  type  of 
transfer  In  order  to:  (a)  demonstrate  the  feasibility  of  Inter-Service  JPM-  technology  transfer, 
and  (b)  save  developmental  costs  In  a  time  of  austere  funding.  While  the  feasibility  question 
has  been  resolved  In  the  positive,  the  lack  of  procedural  documentation  severely  impacts 
long-term  use  of  the  transferred  technologies. 


EXAMPLES  OF  TECHNOLOGY  TRANSFER 


Transfer  of  Jet  Engine  Mechanic  Instruments  to  the  Navy 


Background 


In  1985,  an  effort  was  begun  to  transfer  the  Air  Force  jet  engine  mechanic  (JEM)  Job 
Performance  Measurement  System  (JPMS)  to  the  Navy.  This  transfer  effort  Is  thoroughly  documented 
In  Alba  (1986),  Baker  and  Blackhurst  (1986),  and  Blackhurst  and  Baker  (1985).  However,  a  brief 
review  of  purpose,  approach,  and  cost/benefit  analyses  may  be  needed.  The  JEM  technology 
transfer  was  a  "transfer-ln-klnd."  By  providing  Air  Force-developed  JPMS  Instruments  to  the 
Navy/Marine  Corps  for  use  with  their  J-79  JEMs,  the  Navy  was  able  to  speed  up  the  cradle-to-grave 
(from  job  domain  definition  to  data  analysis)  process. 


Approach 

The  Air  Force  provided  the  Navy  with  a  completed  JPMS  for  J-79  JEMs.  The  Air  Force  JPMS  is 
composed  of  a  Walk-Through  Performance  Test  (WTPT),  four  types  of  rating  forms,  four 
questionnaires  seeking  attltudlnal  and  motivational  Information,  and  Instructions  for  the 
administration  of  all  Instnmients.  The  WTPT  contains  both  hands-on  and  Interview  Items. 
Hands-on  Items  are  designed  to  permit  a  trained  adnlnlstrator  to  observe  the  performance  of  a  job 
Incumbent  on  specific  tasks  or  parts  of  tasks  which  are  part  of  a  normal  first-termer's  job. 
Interview  tasks  require  the  Incumbent  to  describe  how  a  task  should  be  accomplished  (In  a 
show-ana-tell  approach). 

A  representative  set  of  Items  are  present  In  both  hands-on  and  Interview  modes  to  permit 
comparison  between  the  hands-on  benchmark  and  the  Interview  surrogate.  Rating  forms  Include: 
(a)  a  global  rating  of  both  technical  and  Interpersonal  skills,  (b)  a  rating  form  which  examines 
military-related  (as  separate  from  job-related)  performance  (e.g.,  leadership,  Integrity),  (c)  a 
rating  of  all  tasks  reflected  In  the  WTPT  at  the  task  level,  and  (d)  a  rating  of  tasks  grouped 
Into  logical  dimensions  (e.g.,  removing  and  replacing  components,  troubleshooting).  These 
different  rating  forms  are  administered  to  a  job  Incumbent,  the  Incumbent's  coworkers,  and  the 
Incumbent's  supervisor,  after  all  raters  have  undergone  an  extensive  4-hour  training  session. 
The  JPMS  Is  used  to  validate  the  ASVA8  selector  aptitude  Index  (mechanical,  administrative, 
general,  and  electronics)  for  a  specialty.  The  Air  Force- developed  JPMS  was  evaluated  by  Navy, 
Marine  Corps,  and  contractor  personnel  to  ensure  that  the  measurement  system  adequately  measured 
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Navy /Marine  Corps  J-79  JEM  performance.  After  review,  minor  revisions  were  made  to  adapt  the 
Instruments  to  reflect  terminology  conmonly  used  In  the  Navv/Marlne  Corps.  Additionally,  two 
tasks  not  performed  by  Navy /Marine  Corps  JEMs  were  removed,  and  tasks  frequently  performed  by 
these  personnel  were  substituted.  Due  to  an  Insufficient  number  of  Navy  JEMs,  the  revised 
Instruments  were  administered  to  first-term  Marine  J-79  mechanics.  Results  of  these  data  are 
currently  being  conplled.  In  addition  to  the  revised  Air  Force  Instruments,  a  Navy  job  knowledge 
test  was  developed  and  administered  to  the  Marine  J-79  Incumbents. 


Cost  Avoidance  as  a  Result  of  Transfer-In-Kind 

Baker  and  Blackhurst  (1986)  compared  the  costs  associated  with  development  of  a  JEM  JPMS  In 
the  Air  Force  to  costs  Incurred  as  a  result  of  a  transfer-ln-klnd  to  the  Navy/Marine  Corps. 
Table  12,  reproduced  from  Baker  and  Blackhurst  (1986),  Illustrates  the  results  of  this 
comparison.  A  complete  verification  of  these  costs  was  restricted  due  to  the  unavailability  of 
necessary  reference  docunents.  Regardless,  It  can  be  seen  that  the  transfer-ln-klnd  to  the  Navy 
yielded  an  estimated  4001  cost  avoidance.  However,  the  size  of  future  cost  avoidance  figures  may 
be  different  from  that  obtained  by  Baker  and  Blackhurst. 


Table  12.  Jet  Engine  Mechanics  JPMS  Transfer  Cost  Caparison 


Air  Force  Development  Transfer  to  Na 


Man-hours 

Dollars 

Man-hours 

Dollars 

Active  Duty 

3,797 

$  27,000 

260 

$  5,200 

Civilian 

2,509 

80,300 

102 

2,300 

Contractor 

4,164 

143,000 

1,860 

46,000 

Total 

10,470 

$250,300 

2,222 

$53,500 

Conclusions 

Although  the  final  results  of  this  transfer-ln-klnd  effort  are  not  yet  In,  the  work  completed 
so  far  Indicates  that  this  method  of  transfer  Is  not  only  feasible  but  also  accelerates 
cross-ServIce  developmental  efforts  and  avoids  the  costs  associated  with  simultaneous  development 
of  separate  JPM  Instruments  for  similar  Service  specialties.  Transfer-ln-klnd,  however,  should 
not  be  limited  to  those  specifically  Interested  in  JPM  research  across  the  Services.  Each 
Service  may  have  needs  In  their  respective  operational  comnunltles  for  Instruments  developed  In 
the  Joint-Service  JPM  project.  Private  Industry  may  also  need  these  same  Instruments  but  may 
lack  the  expertise  for  revision/development  of  performance  measures.  The  Services  should  make  an 
effort  to  provide  Industry  with  completed  instruments  to  determine  whether  the  measurement 
approach  Is  usable  outside  a  military  framework. 


Upon  recommendation  of  the  National  Academy  of  Sciences  (NAS),  the  Air  Force  Is  developing 
Army  Job  Knowledge  Tests  for  four  Air  Force  specialties.  This  effort  represents  the  first  type 
of  transfer— transfer  of  technology— and,  as  such,  requires  that  great  care  be  taken  to  ensure 
Air  Force  test  developers  understand  the  concepts  which  underlie  the  procedures  used  by  Army 
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researchers  to  develop  Army  Job  Knowledge  Tests.  This  effort  Is  facilitated  by  ongoing  Alp  Force 
efforts  to  develop  JPMSs  for  four  Air  Force  specialties.  Development  of  Job  Knowledge  Tests  will 
occur  simultaneously  with  Air  Force  JPMS  deve’opment  for  these  four  specialties. 

The  approach  being  taken  to  develop  Job  Knowledge  Tests  Includes  consultation  with  Army 
representatives  throughout  development.  Job  Knowledge  Test  developers  working  for  the  Air  Force 
will  Initially  review  documents  provided  by  the  Army  that  outline  test  content  selection  and  test 
development  procedures.  In  addition,  test  developers  will  review  the  test  content  resulting  from 
Air  Force  JPMS  development.  After  background  Information  has  been  reviewed,  test  developers  will 
meet  with  Army  researchers  to  discuss  all  aspects  of  Job  Knowledge  Test  development  including 
task  sampling  procedures,  test  domain  definition,  test  development,  and  methodological 
requirements.  Once  the  instruments  have  been  developed,  test  developers  will  again  confer  with 
Army  researchers,  and  based  upon  their  input,  make  any  revisions  necessary  to  ensure  the  accuracy 
of  the  Job  Knowledge  Tests  as  job  performance  measures.  After  development  of  the  Job  Knowledge 
Tests  for  the  four  designated  Air  Force  specialties  has  been  completed,  the  data  collection  and 

data  analysis  phases  will  begin.  Approximately  250  subjects  in  each  of  the  four  career  fields 

will  be  adninistered  the  Job  Knowledge  Tests  along  with  the  Air  Force  JPMS  and  a  battery  of 

additional  predictors  which  include  the  Air  Force  Apprentice  Knowledge  Test  for  each  career  field 
and  the  Space  Perception  Test.  A  counterbalancing  technique  for  test  adninl  strati  on  will  be 
developed  to  guard  against  bias  resulting  from  test  administration  order.  Data  analysis  will 
examine,  among  other  things,  test  reliability,  validity,  and  the  fidelity  of  each  surrogate  job 
performance  measure  when  compared  to  the  hands-on  benchmark.  The  JSJPMWG  will  assist  in 
developing  a  data  analysis  strategy  that  thoroughly  assesses  the  potential  utility  and  the  most 
advantageous  applications  of  surrogate  job  performance  measures  in  future  job  performance 
measurement  research.  JSJPMWG  assistance  during  both  test  development  and  data  analysis  will 

ensure  a  forward-looking  perspective  for  technology  transfer  among  the  Services. 


Anticipated  Results 

The  most  important  result  from  this  effort  will  be  a  demonstration  of  technology  transfer. 
Technology  transfer  has  not  yet  been  attempted  by  the  Services  and  may  prove  more  difficult  than 
transfer-ln-klnd  because  It  will  Involve  transfer  of  abstract  concepts  (procedures,  methods,  and 
approaches),  rather  than  concrete  Information  (Instruments  already  developed).  Even  with  the 
Army  consulting  throughout  the  development  process.  It  Is  Important  to  consider  whether  It  Is 
possible  to  transfer  technology  from  one  Service  to  another  and  still  maintain  the  uniqueness  of 
the  donating  Service's  technology. 

Several  things  may  Impact  this  transfer.  First,  job  performance  measurement  technology  can 
be  used  within  each  Service  for  additional  payoffs,  although  each  of  the  Services  has  as  Its 
overall  objective  the  validation  of  enlistment  standards  under  the  Congressional  mandate.  For 
examile,  the  Air  Force  plans  to  use  job  performance  measurement  technology  extensively  In  the 
area  of  training.  Additional  wlthln-ServIce  applications  of  job  performance  measurement 
technology  may  overlap  among  the  Services,  but  at  the  same  time  stem  from  Service-unique  research 
requirements. 

Second,  1 nter-ServIce  differences  In  test  content  definition,  task  sampling  procedures,  etc. 
may  Impact  technology  transfer,  since  each  Service  Is  developing  both  hands-on  measures  and 
Service-unique  surrogate  measures.  However,  we  are  looking  specifically  at  methods  of  technology 
transfer  which  will  transcend  these  differences. 


If  successful,  this  transfer  may  facilitate  and  promote  future  job  performance  measurement 
technology  transfers  among  the  Services.  An  1m>ortant  product  resulting  from  transfer  of  the 
Army  technology  to  the  Air  Force  will  be  a  report  docunentlng  procedures  followed,  problems 
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This  paper  also  investigated  the  ability  of  the  ASVA8  to  predict  the  JPMS  measure.  In 
general,  the  results  Indicated  modest  success  In  predicting  on-the-job  performance.  Only  the 
WTPT  measures  appeared  to  be  predicted  by  r*e  ASVAB. 

Of  the  ASVAB  con^osltes,  the  Mechanical  and  Electronics  appeared  to  be  most  valid.  This 
occurred  for  both  uncorrected  and  corrected  correl atlons.  This  lent  some  confidence  to  the  use 
of  corrected  correlations.  However,  until  JPMS  results  from  other  specialties  are  available,  the 
i nterpretation  of  corrected  correlations  must  remain  cautious. 

In  conclusion,  the  results  of  this  paper  Indicate  that  (a)  JPMS  has  a  meaningful  structure 
which  Is  quite  similar  to  the  hypothesized  structure,  and  (b)  ASVAB  predictors  are  only  modestly 
predictive  of  the  on-the-job  performance  of  jet  engine  mechanics. 
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Training  Evaluation 


Job  performance  measurement  definitely  has  an  application  to  training  evaluation.  The  Air 
force  currently  uses  a  variety  of  performance  measures  to  assess  the  effectiveness  of  Its 
training  Air  Force-wide.  These  measures  vary  from  hands-on  measurement  to  surveys  or  Interviews 
with  students  and  trainees.  The  most  common  type  of  performance  measurement  Is  a 
paper-and-pencll  knowledge  test.  However,  In  field  settings,  the  actual  hands-on  evaluation  Is 
more  commonly  used.  Both  of  these  techniques  are  widely  used  In  the  OoO  performance  measurement 
research  program,  and  results  of  their  findings  should  enhance  training  evaluation  policy  and 
procedures  In  the  Air  Force.  Two  Air  Force  training  research  projects,  the  Advanced  Training 
System  (for  technical  schools)  and  the  Advanced  On-the-Job  Training  System  (for  on-the-job 
training),  will  have  performance  evaluation  components  which  will  use  a  variety  of  performance 
techniques.  Procedures  and  lessons  learned  from  the  DoO  effort  can  and  will  be  applied  to  the 
training  evaluation  area. 


Transfer  to  the  Private  Sector 


One  of  the  greatest  payoffs  of  Inter-Service  technology  transfer  Is  providing  both  the 
technology  and  developed  Instruments  ( transfer-ln-klnd)  to  the  private  sector.  Based  on  the 
successes  achieved  so  far  and  planned  future  endeavors,  there  should  be  a  substantial  pool  of 
Inter-Service  JPM  technology  which.  If  shared  with  Industry,  would  significantly  aid  the  DoO  In 
establishing  performance  requirements  for  Its  contractors.  This  should  enhance  productivity  and 
product  quality  throughout  weapon  system  acquisition  activities.  In  the  future,  the  Air  Force 
will  be  developing  a  JPMS  for  firefighters  and  a  medical  specialty.  Not  only  will  the  other 
Services  benefit  from  these  developmental  efforts  but  DoO  or  Service  contractor  organizations  In 
these  specific  areas  could  also  benefit,  by  knowing  what  Is  expected  of  their  personnel  and  by 
being  able  to  better  select  applicants.  Another  possible  transfer  involves  the  use  of  jet  engine 
mechanic  measures  by  such  aircraft  manufacturers  as  Boeing  and  McDonnell  Douglas,  as  well  as  the 
airlines.  The  list  of  possible  companies  that  could  benefit  from  job  performance  measurement 
technology  transfer  may  be  endless. 

Without  spending  time  discussing  the  Implications  of  the  Uniform  Guidelines  on  Employee 
Selection  Procedures  (1978),  suffice  It  to  say  that  these  Guidelines  require  employers  to  show  a 
direct  link  between  selection  Instruments  and  actual  job  performance.  Interestingly,  this  Is  the 
primary  focus  of  the  Joint-Service  JPM  project.  Private  sector  use  of  Service-developed 
Instruments  and  technology  will  have  an  Iweeasurable  positive  Impact  on  Industry's  selection, 
classification,  and  training  decisions. 


CONCLUSION 

Inter-Service  technology  transfer  has  made  many  promises  and  provided  many  payoffs  In  a  short 
period  of  time.  These  payoffs,  however,  are  only  a  beginning.  The  transfer  of  technology,  both 
procedural  transfer  and  transfer-ln-klnd,  has  great  potential  among  the  Services  as  well  as 
between  the  DoO  and  private  Industry.  Past  and  current  transfer  efforts  have  and  will  continue 
to  demonstrate  cost  savings/cost  avoidance.  Technology  transfer  Is  a  reality,  but  to  fully 
realize  the  multitude  of  possible  payoffs,  we  must  all  march  to  the  beat  of  the  same  drawer. 
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